Whether you’re thinking up a new open source project or building your product roadmap at work, there’s usually some period of time where you’re doing research. It may start with identifying a problem and then lead into a period of research into what solutions (if any) already exist, how they stack up, and what the broader market looks like.
At GitHub, we’re lucky to have an in-house team of researchers and analysts—and, partner with external academic researchers, too. Where our in-house team often tackles shorter duration studies where insights can quickly be reflected in our products, our academic partners typically conduct larger studies that often underscore key data points that inform our work and product strategy.
Today, we wanted to share three separate academic research studies that directly informed how we have built and evolved our community forum tool GitHub Discussions.
Originally the brainchild of our internal product teams, GitHub Discussions was designed to help developers and open source communities shift conversations without a clear, actionable outcome from GitHub Issues to a more open-ended community forum tool—right in GitHub.
Developers and open source maintainers on GitHub told us that issues were becoming a catch-all for conversations, and it was making it difficult to get work done especially as teams and communities continued to grow. We also knew (through our own research) that context switching between forum tools, chat applications, and GitHub was making it harder to get work done and more difficult to centralize OSS communities in one place.
With that, we shipped a beta of GitHub Discussions in 2020 as a way to continue testing our hypotheses on a larger scale.
While that was happening, a research team led by Dr. Hideaki Hata, an associate professor on the Faculty of Engineering at Shinshu University in Japan, reached out to us about conducting a study of GitHub Discussions early adopters.
Dr. Hata had previously done research on GitHub before around the GitHub Sponsors program and the intersection between academic papers and GitHub, among other things. But this new study was focused on determining if our initial hypotheses around GitHub Discussions were right (that is, it would create a better place for community conversations).
In the course of his study, Dr. Hata found that GitHub Discussions did make it simpler to separate community conversations from work in Issues. He and his team also found GitHub Discussions weren’t just a place for conversations, but a place where communities and teams would plan upcoming work—and that produced a positive result on community productivity rates.
A big reason we built GitHub Discussions was to make life easier for maintainers—both in engaging and cultivating their communities directly on GitHub and separating out conversations from the daily work.
Early on, we tasked ourselves with building a community insights dashboard that would quickly surface community health metrics to maintainers. Our goal was to make it easier for maintainers to understand how active their communities are, and identify ways to attract, teach, and retain contributors.
Our research team conducted quantitative and qualitative research activities to understand what a maintainer would actually need to improve their workflows. The first version of our dashboard featured basic metrics—contribution activity, GitHub Discussions page views, daily contributors, and new contributors—that we thought everyone would need. This was a start, but we knew we needed more.
That quickly became apparent when we partnered with Dr. Denae Ford, a senior researcher at Microsoft Research, and Mariam Guizani, a computer science Ph.D. student at Oregon State University interning at Microsoft Research. Their goal was to determine what improvements would make it easier to “attract and retain [contributors and] newcomers” to open source projects. In a published academic paper, alongside researchers Tom Zimmermann and Anita Sarma, they highlighted a series of maintainer-tested recommendations and presented their findings at the 2022 International Conference on Software Engineering.
Guizani’s recommendations included design mockups of new dashboard features, including a graph showing new contributor activity, overall community activity rates, contributor retention trends, and ways for maintainers to quickly see and call attention to “rising contributors” who are active within an OSS project.
Early on, a handful of maintainers using GitHub Discussions told us they were overloaded by answering the same questions multiple times. For anyone who has used a community forum, duplicate content—or near-duplicate content—is a common experience.
But here’s the funny part: as our internal research teams sat down to conduct qualitative interviews with contributors, we didn’t hear this was a core problem for a majority of users. This could have been because of who we spoke to in our sample of contributors. We knew there were some issues with duplicate discussions, but based on our interviews we had no idea how big a problem they were.
And that’s part of the challenge about conducting qualitative research on a platform like GitHub with millions of developers. Not everyone has even close to the same experience, so even if a comparatively small group of people weren’t affected by a particular problem, well, it doesn’t mean there wasn’t a problem. We saw this up close when two academic researchers looked into the prevalence and impact of duplicate discussions among developers using GitHub Discussions.
In a large-scale study, Márcia Lima, a Brazilian Ph.D. candidate at Universidade Federal do Amazonas (UFAM) and faculty member at Universidade do Estado do Amazonas (UEA), and Dr. Igor Steinmacher, an Assistant Professor at Northern Arizona University, alongside Tayana Conte and Bruno Gadelha, proposed an approach based on a Sentence-BERT model to detect related, and even duplicate, discussion posts in OSS communities.
The result? In a single project alone, we found 151 related discussion posts; among them 81 were clear duplicates. Developers were duplicating their posts to emphasize their need for help—and some even triplicated their posts on the same day, which was surprising. All of this was leading to a negative maintainer and developer experience.
What we learned from Marcia’s study was the reason they were duplicating was to increase odds of getting attention and answers. We’re working now to ship new features that enable maintainers and GitHub Discussions administrators to quickly identify and mark duplicate discussions and questions—and other features with plans to automate the process overall.
Whether we’re doing product research here at GitHub or with academic partners, our work always revolves around building the best developer experiences possible. If you’re using GitHub Discussions, let us know what you think!
And if you’re interested in learning more about how GitHub partners with academic researchers, check out some additional studies academic researchers have conducted on GitHub:
- [Microsoft Research] Towards Mining OSS Skills from GitHub Activity
- [Cornell University] GitHub Sponsors: Exploring a New Way to Contribute to Open Source**
- [Cornell University] FixMe: A GitHub Bot for Detecting and Monitoring On-Hold Self-Admitted Technical Debt