The State of the Octoverse: machine learning

Image of Thomas Elliott

In our 2018 Octoverse report, we noticed machine learning and data science were popular topics on GitHub. tensorflow/tensorflow was one of the most contributed to projects, pytorch/pytorch was one of the fastest growing projects, and Python was the third most popular language on GitHub. We decided to dig a little deeper into the state of machine learning and data science on GitHub.

We pulled data on contributions between January 1, 2018 and December 31, 2018. Contributions could include pushing code, opening an issue or pull request, commenting on an issue or pull request, or reviewing a pull request. For the most imported packages, we used data from the dependency graph, which includes all public repositories and any private repositories that have opted in to the dependency graph.

Programming languages

Top Machine Learning Languages on GitHub for 2018

We looked at contributors to repositories tagged with the “machine-learning” topic, and ranked the most common primary languages of the repositories. Python is the most common language among machine learning repositories and is the third most common language on GitHub overall. However, not all machine learning happens in Python: some of the most common languages on GitHub are also common languages for machine learning projects. C++, JavaScript, Java, C#, Shell, and TypeScript are all in the top 10 languages on GitHub and the top 10 for machine learning projects. Julia, R, and Scala all appear in the top 10 for machine learning projects but not for GitHub overall. Julia and R are both languages commonly used by data scientists, and Scala is becoming increasingly common when interacting with big data systems like Apache Spark.

Top packages imported by machine learning projects on GitHub for 2018

We pulled data from the dependency graph to calculate the percentage of projects with machine learning or data science topics that import popular Python packages. The list above shows the top ten packages imported by these projects. Here’s what we found:

  • Numpy—a package with support for mathematical operations on multidimensional data—was the most imported package, used in nearly three-quarters of machine learning and data science projects.
  • Scipy, a package for scientific computation, pandas, a package for managing datasets, and matplotlib, a visualization library, are all used in over 40% of machine learning and data science projects.
  • Scikit-learn is a popular machine learning package, containing implementations of a large number of machine learning algorithms—it’s used by nearly 40% of projects.
  • Tensorflow, a package for working with neural nets, is used in nearly a quarter of packages.

The rest of the top ten are utility packages: six is a Python 2 and 3 compatibility library, and python-dateutil and pytz are packages for working with dates.

Top machine learning projects on GitHub for 2018

We also looked at which open source projects with the “machine-learning” label had the most contributors in 2018. Tensorflow was by far the most popular with more than five times the number of contributors of the second most popular project, scikit-learn. Two projects, explosion/spaCy and RasaHQ/rasa_nlu, are focused on natural language processing problems. Another four projects, CMU-Perceptual-Computing-Lab/openpose, thtrieu/darkflow, ageitgey/face_recognition, and tesseract-ocr/tesseract, are focused on image processing. The Julia language source code was also one of the most contributed to projects in 2018.

We love seeing the amazing projects you’ve built using machine learning. If you want to explore more of these projects on GitHub, check out the Explore page.