Project Inspiration - We all have seen projects where we try to classify the quality of projects based on some stats to achieve some objective. This is a similar project but with the intent to build a system that can potentially help the developer to efficiently manage his project by giving high priority to certain commits.

Objective - Build a system that can classify the GitHub commits on basis of their quality using an unsupervised method (K-medoids and random forest)

Result / Outcome -

The algorithm has divided the commits into three categories i.e. Cluster 1, Cluster 2, and Cluster 3.

Cluster 1 will represent low-quality commits, cluster 2 will represent mid-quality commits and cluster will represent high-quality commits.

his Figure tells us how the algorithm classified more than 300,000 commits.

Below are Tables generated to let us know the properties of each cluster.

Performance metrics

The performance metric table is generated by a random forest algorithm. It tells us how the root mean squared error, explained variance score, median absolute error, and mean absolute error have changed for a set of commits under different clusters.

For example, we can see how root mean squared error, explained variance score, median absolute error, and mean absolute error have all decreased for Cluster 2 and Cluster 3

Importance of different variable

This table tells us how important each variable is for each cluster. From this table, we reach the conclusion that high-quality commits (Cluster 3) have to be both well-written (Reliability) and should be quite informative (Entropy). It also tells us that number of the file changed is not very relevant for a commit to be considered high quality.

Cluster metrics

This table tells us the basic makeup of each cluster.

Correlation metrics

This table tells us the correlation between metrics such as the number of files changed, entropy, readability, and clusters.

Whole Journey

Here we will talk about the steps I have taken to reach our objective.

Data Collection

For data collection, we have used the commit history of 6 large open-source projects on Git Hub. these 6 open source projects are -

WordPress

CDT

Tomcat

PHP

Mysql

GnuCash

At first, tried using beautiful soup and selenium to get project history but it was a failure then I used a much simpler method, where I downloaded each project on my PC and used the GitHub command to extract all commits into text files.

You can check out these commit text files here

For a bit more context, here is an example of how to commit data looked like

commit d98c376045d941a0bc06e0b2415328948763c132

Author: Mat Booth <mat.booth@gmail.com>

Date: Wed Jul 28 14:32:04 2021 +0100

Bug 562000 - Remove dependency to com.ibm.icu from CDT DSF PDA example

Switch to JRE implementations:

* com.ibm.icu.text.MessageFormat -> java.text.MessageFormat

Signed-off-by: Mat Booth <mat.booth@gmail.com>

Change-Id: I2c7eae20e197d0871694b09ec375dacb940a942a

2 files changed, 6 insertions(+), 9 deletions(-)

To make this data usable, I have written a python code to take the data from a text file to xlsx/CSV files and here was the result.

To make sure, my python code has worked correctly, I have checked the values of the first 5 rows and the last 5 rows with their respective data.

Data Cleaning

In this step, I performed three things -

1. I dropped the columns like Date, Author, and Commit ID

2. Rows With NULL Commit messages were dropped

3. A detailed EDA was done and all outliers are removed.

Feature engineering

After data cleaning, I was left with 4 columns i.e. commit message, number of insertions, number of deletions, and number of files changed.

The commits message column was not much usable so after much research, I decided to add two readability and entropy. the readability column was derived from the Flesch Kincaid concept which gives a score on the basis of how difficult it is to understand the given text.

For More info READABILITY METRICS:-

100.00–90.00 - Very easy to read. Easily understood by an average 11-year-old student.

90.0–80.0 - Easy to read. Conversational English for consumers.

80.0–70.0 - Fairly easy to read.

70.0–60.0 - Easily understood by 13- to 15-year-old students.

60.0–50.0 - Fairly difficult to read.

50.0–30.0 - College Difficult to read.

30.0–10.0 - Very difficult to read.

10.0–0.0 - Extremely difficult to read.

The entropy column was based on the concept of Entropy, which tells us how informative a given text is. for this entropy, I used an open-source python code that help us calculate entropy for the given text.

EDA

Only The best looking result are shown -

Solution Generation

The part where I have implemented algorithms on our data set. I have k metoids for clustering and random forest algorithm to get some interesting insights (Most of these insight are already shared at result section)