I tried to measure the relative trends that go on in respective developer communities. Say for example, whether Ruby people are more active than Java developers. How do they compare with each other, by the use of different metrics and if we could deduce any important conclusions from that.
Before I begin, I would like to make an important confession: I am in no way any data scientist, nor have too much of education in statistics. I have just tried to analyze the trends using standard tools and techniques. Maybe my inspiration was:
The best way to get started in data science is to DO data science!
First, data scientists do three fundamentally different things: math, code (and engineer systems), and communicate. Figure out which one of these you’re weakest at, and do a project that enhances your capabilities. Then figure out which one of these you’re best at, and pick a project which shows off your abilities.
- Watchers – People who are following the project, but not contributing code to it
- Forks – People who have created a fork and making their own contributions to the project
Both my assumptions are the ideal case. I assume the watchers are following the project, while people who fork are contributing code (though in many cases, people may not make any changes to the forks).
I first wrote a Python script to pull out data from Github on some projects for each language and store them into CSV files. The Search API came of help here.
This creates datasets for the languages with columns in the order – name of project, owner of projects, number of watchers, number of forks, if the repository is a fork or not. Since Github doesn’t list projects by language without a search keyword, we iterate over all the alphabets to get a list of 2,600 repositories.
It yielded a linear model:
I did the same computation for the number of watchers by sort them by column 3 of the original dataset and make appropriate changes to the code.
This yielded a linear model:
Java vs. Python
After trying to compute it by the number of forks for Java vs. Python here is what I got:
The linear model came out as:
python = 0.7672 * java
Thus we can see the trend-line is more oriented towards Java, giving us a conclusion that Java developers fork repositories more than Python devs and try to make changes or contribute.
On doing the same analysis for watchers,
This result was a bit surprising, the linear model came out as:
python = 1.423 * java
Thus in the Python community there are more watchers for other projects compared to the Java community. So Python people love to follow a project more, when Java devs prefer forking it and contributing it.
Python vs. Ruby
The results of the number of forks for Python vs. Ruby is:
This gives a linear model:
ruby = 2.743 * python
The trend-line is more oriented towards the y-axis, ie. Ruby, concluding that Ruby devs fork a repo much more are compared to a Python developer. Rubyists like to jump into it, and make changes right away.
For the number of watchers:
This gave us:
ruby = 2.329 * python
I don’t need to say much. Rubyists follow others projects more than the corresponding Python devs.
We can figure out important trends between different programming languages by comparing different aspects together and we can figure out certain important facts, like Java developers like to fork more than Python developers while Python developers like to follow others projects more. Rubyists fork and follow the projects more as compared to Python people, etc.
Its a nice way to learn and analyze different trends developer communities. More can be done, this is just a scratch on the surface.