Making open source data more available
Data gives us insight into how people build software, and the activities of open source communities on GitHub represent one of the richest datasets ever created of people working together…
Data gives us insight into how people build software, and the activities of open source communities on GitHub represent one of the richest datasets ever created of people working together at scale.
In 2012, the community led project, GitHub Archive was launched, providing a glimpse into the ways people build software on GitHub. Today, we’re delighted to announce that, in collaboration with Google, we are releasing a collection of additional BigQuery tables to expand on the data from that project1.
This 3TB+ dataset comprises the largest released source of GitHub activity to date. It contains activity data for more than 2.8 million open source GitHub repositories including more than 145 million unique commits, over 2 billion different file paths and the contents of the latest revision for 163 million files, all of which are searchable with regular expressions.
With this new dataset, it’s a simple query to find out which are the most commonly used Go packages, which US-schools have the most open source contributors and find all of the things that should never happen.
Just as books capture thoughts and ideas, software encodes human knowledge in a machine-readable form. This dataset is a great start toward the pursuit of documenting the open source community’s vast repository of knowledge—but there’s more to be done. Over the coming months, you can expect to hear from us on how we hope to make open source data even more available, portable, and useful.
Whether you’re a researcher studying open source communities, an organization looking to monitor the health of your open source projects, or curious about the latest trends in software development, go check out the new dataset hosted on Google Cloud to analyze one of the largest datasets of people collaborating on the planet.
1. If you’d like to hear more about the data release then check out this episode of The Changelog.
Tags:
Written by
Related posts
The top 10 gifts for the developer in your life
Whether you’re hunting for the perfect gift for your significant other, the colleague you drew in the office gift exchange, or maybe (just maybe) even for yourself, we’ve got you covered with our top 10 gifts that any developer would love.
Congratulations to the winners of the 2024 Gaady Awards
The Gaady Awards are like the Emmy Awards for the field of digital accessibility. And, just like the Emmys, the Gaadys are a reason to celebrate! On November 21, GitHub was honored to roll out the red carpet for the accessibility community at our San Francisco headquarters.
Students: Start building your skills with the GitHub Foundations certification
The GitHub Foundations Certification exam fee is now waived for all students verified through GitHub Education.