
For Good First Issue: Introducing a new way to contribute
For Good First Issue is a curated list of open source projects that are also digital public goods and need the help of developers.
We decided to dig a little deeper into the state of machine learning and data science on GitHub. Read on to learn more about what we found.
In our 2018 Octoverse report, we noticed machine learning and data science were popular topics on GitHub. tensorflow/tensorflow was one of the most contributed to projects, pytorch/pytorch was one of the fastest growing projects, and Python was the third most popular language on GitHub. We decided to dig a little deeper into the state of machine learning and data science on GitHub.
We pulled data on contributions between January 1, 2018 and December 31, 2018. Contributions could include pushing code, opening an issue or pull request, commenting on an issue or pull request, or reviewing a pull request. For the most imported packages, we used data from the dependency graph, which includes all public repositories and any private repositories that have opted in to the dependency graph.
We looked at contributors to repositories tagged with the “machine-learning” topic, and ranked the most common primary languages of the repositories. Python is the most common language among machine learning repositories and is the third most common language on GitHub overall. However, not all machine learning happens in Python: some of the most common languages on GitHub are also common languages for machine learning projects. C++, JavaScript, Java, C#, Shell, and TypeScript are all in the top 10 languages on GitHub and the top 10 for machine learning projects. Julia, R, and Scala all appear in the top 10 for machine learning projects but not for GitHub overall. Julia and R are both languages commonly used by data scientists, and Scala is becoming increasingly common when interacting with big data systems like Apache Spark.
We pulled data from the dependency graph to calculate the percentage of projects with machine learning or data science topics that import popular Python packages. The list above shows the top ten packages imported by these projects. Here’s what we found:
The rest of the top ten are utility packages: six is a Python 2 and 3 compatibility library, and python-dateutil and pytz are packages for working with dates.
We also looked at which open source projects with the “machine-learning” label had the most contributors in 2018. Tensorflow was by far the most popular with more than five times the number of contributors of the second most popular project, scikit-learn. Two projects, explosion/spaCy and RasaHQ/rasa_nlu, are focused on natural language processing problems. Another four projects, CMU-Perceptual-Computing-Lab/openpose, thtrieu/darkflow, ageitgey/face_recognition, and tesseract-ocr/tesseract, are focused on image processing. The Julia language source code was also one of the most contributed to projects in 2018.
We love seeing the amazing projects you’ve built using machine learning. If you want to explore more of these projects on GitHub, check out the Explore page.