Trends in Developers

I tried to measure the relative trends that go on in respective developer communities. Say for example, whether Ruby people are more active than Java developers. How do they compare with each other, by the use of different metrics and if we could deduce any important conclusions from that.

Before I begin, I would like to make an important confession: I am in no way any data scientist, nor have too much of education in statistics. I have just tried to analyze the trends using standard tools and techniques. Maybe my inspiration was:

The best way to get started in data science is to DO data science!

First, data scientists do three fundamentally different things: math, code (and engineer systems), and communicate. Figure out which one of these you’re weakest at, and do a project that enhances your capabilities. Then figure out which one of these you’re best at, and pick a project which shows off your abilities.

Getting Started with Data Science by Hilary Mason

I wanted to try some real world data collected by me, so I turned to the Github APIs and collected the following metrics about some random 2,600 projects for the languages Java, JavaScript, Ruby and Python:

  • Watchers –  People who are following the project, but not contributing code to it
  • Forks – People who have created a fork and making their own contributions to the project

Both my assumptions are the ideal case. I assume the watchers are following the project, while people who fork are contributing code (though in many cases, people may not make any changes to the forks).

I first wrote a Python script to pull out data from Github on some projects for each language and store them into CSV files. The Search API came of help here.

This creates datasets for the languages with columns in the order – name of project, owner of projects, number of watchers, number of forks, if the repository is a fork or not. Since Github doesn’t list projects by language without a search keyword, we iterate over all the alphabets to get a list of 2,600 repositories.

Next just to compute it by the number of forks for Java vs. JavaScript, I sort the datasets for Java and JavaScript by their 4th column, ie. the number of forks and then plot one against one. I tried to fit a linear model in the data with a zero intercept to find the best fit line. These computations were done using R.

Java vs. JavaScript

Java vs. JavaScript by Forks
Java vs. JavaScript by Forks

It yielded a linear model:

javascript = 3.609 * java

Thus we can notice that the tendency to fork is more in the JavaScript developers than Java. JavaScript folks like to take some project and then make changes or contribute to it more easily compared to Java.

I did the same computation for the number of watchers by sort them by column 3 of the original dataset and make appropriate changes to the code.

Java vs. JavaScript by Watchers
Java vs. JavaScript by Watchers

This yielded a linear model:

javascript = 7.22 * java

This result confirms that the there are lot more JavaScript developers who follow other JavaScript projects compared to the Java developers.

Java vs. Python

After trying to compute it by the number of forks for Java vs. Python here is what I got:

Java vs. JavaScript by Forks
Java vs. JavaScript by Forks

The linear model came out as:

python = 0.7672 * java

Thus we can see the trend-line is more oriented towards Java, giving us a conclusion that Java developers fork repositories more than Python devs and try to make changes or contribute.

On doing the same analysis for watchers,

Java vs. JavaScript by Watchers
Java vs. JavaScript by Watchers

This result was a bit surprising, the linear model came out as:

python = 1.423 * java

Thus in the Python community there are more watchers for other projects compared to the Java community. So Python people love to follow a project more, when Java devs prefer forking it and contributing it.

Python vs. Ruby

The results of the number of forks for Python vs. Ruby is:

Python vs. Ruby by Forks
Python vs. Ruby by Forks

This gives a linear model:

ruby = 2.743 * python

The trend-line is more oriented towards the y-axis, ie. Ruby, concluding that Ruby devs fork a repo much more are compared to a Python developer. Rubyists like to jump into it, and make changes right away.

For the number of watchers:

Python vs. Ruby by Watchers
Python vs. Ruby by Watchers

This gave us:

ruby = 2.329 * python

I don’t need to say much. Rubyists follow others projects more than the corresponding Python devs.

Conclusion

We can figure out important trends between different programming languages by comparing different aspects together and we can figure out certain important facts, like Java developers like to fork more than Python developers while Python developers like to follow others projects more. Rubyists fork and follow the projects more as compared to Python people, etc.

Its a nice way to learn and analyze different trends developer communities. More can be done, this is just a scratch on the surface.

Advertisements

5 thoughts on “Trends in Developers

  1. Just using Github skews things a bit in the Ruby direction I would think. If you were to take Bitbucket repos into account, add them to the Github repos and come up with an average, I believe you’d have a more accurate metric.

    1. Hmm.. Thanks for that! You are right, the Ruby Community prefers to use Github, and so it skews the results a bit in that direction. This was a just an experiment to see different metrics, but using a different source like Bitbucket as well, will improve the results.

  2. Fork is not yet a contribution. Might be a fork for a dirty hack. Measure pull requests for contribution and maybe number of commiters for a repo.

    1. Yes, I do agree to that, and just as I wrote in the main post as well, I assume it to be a contribution. Mostly forks are for making your own changes to the code, whether you push it upstream or not. Many a times, people just fork a project and leave it as it is. Just to make it simple I assume a fork to be a score of much the community hacks on others codes.

      The correct approach would be to perhaps check the authors of each commit, and then count it for the number of contributions, but that would too complicated for the purpose.

      1. I disagree that it would be too complicated… It’s necessary for an accurate result-set. Your data set is very biased. People fork repositories for the purpose of showing what they’re interested in as well. Recruiters track this on github, so, it may be that rubyists exploit this moreso than people in other communities as opposed to rubyists contribute more.

        If you want to measure which community contributes more, you have to actually measure contributions. Measuring anything else is, well… Measuring something other than contributions.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s