In our 2018 Octoverse report, we noticed machine learning and data science were popular topics on GitHub. tensorflow/tensorflow was one of the most contributed to projects, pytorch/pytorch was one of the fastest growing projects, and Python was the third most popular language on GitHub. We decided to dig a little deeper into the state of machine learning and data science on GitHub.
We pulled data on contributions between January 1, 2018 and December 31, 2018. Contributions could include pushing code, opening an issue or pull request, commenting on an issue or pull request, or reviewing a pull request. For the most imported packages, we used data from the dependency graph, which includes all public repositories and any private repositories that have opted in to the dependency graph.
Popular machine learning and data science packages
We pulled data from the dependency graph to calculate the percentage of projects with machine learning or data science topics that import popular Python packages. The list above shows the top ten packages imported by these projects. Here’s what we found:
- Numpy—a package with support for mathematical operations on multidimensional data—was the most imported package, used in nearly three-quarters of machine learning and data science projects.
- Scipy, a package for scientific computation, pandas, a package for managing datasets, and matplotlib, a visualization library, are all used in over 40% of machine learning and data science projects.
- Scikit-learn is a popular machine learning package, containing implementations of a large number of machine learning algorithms—it’s used by nearly 40% of projects.
- Tensorflow, a package for working with neural nets, is used in nearly a quarter of packages.
The rest of the top ten are utility packages: six is a Python 2 and 3 compatibility library, and python-dateutil and pytz are packages for working with dates.
Popular machine learning projects
We also looked at which open source projects with the “machine-learning” label had the most contributors in 2018. Tensorflow was by far the most popular with more than five times the number of contributors of the second most popular project, scikit-learn. Two projects, explosion/spaCy and RasaHQ/rasa_nlu, are focused on natural language processing problems. Another four projects, CMU-Perceptual-Computing-Lab/openpose, thtrieu/darkflow, ageitgey/face_recognition, and tesseract-ocr/tesseract, are focused on image processing. The Julia language source code was also one of the most contributed to projects in 2018.
We love seeing the amazing projects you’ve built using machine learning. If you want to explore more of these projects on GitHub, check out the Explore page.