Investigating Python's history from the shell

Who are the top 10 commit authors in CPython's history? What percent of all of CPython's commits were made by the top 10 commit authors? And can I figure this out in a single bash line?

The answer to the third question is yes (it's a very long line). Let's look at how we'd build that line to get the answers to (1) and (2). I'll actually cheat here and use multiple lines for readability, but at the very end I'll show how it can be done in a single line.

Parsing through a git repo's history is an excellent way to learn both git and bash. Git's interface is designed to be shell friendly so the same tricks involved in parsing git's output apply to parsing files and any kind of program output.

Let's look at pulling a few interesting stats out of CPython's history

  1. How many commits are in CPython's history?
  2. How many different people have committed to CPython?
  3. Who are the top 10 committers?
  4. What % of all commits were made by the top 10 committers?
In [1]:
%cd ~/code/cpython
/Users/Kyle/code/cpython

How many commits are in CPython's history?

The all-important git rev-list prints all commit hashes reachable from a given commit. wc -l counts the number of lines in stdin so combining the two prints the total number of commits.

In [2]:
%%bash
git rev-list HEAD | wc -l
  106683

How many different people have committed to CPython?

We can use git show to extract information about a particular commit. The output of git show opens in a paged view akin to less, but the --no-pager flag instructs git to just send its output to stdout. The --no-patch flag omits the commit's diff and displays just the commit information.

In [3]:
%%bash
git show HEAD -n 1 | git --no-pager show --no-patch
commit b146568dfcbcd7409c724f8917e4f77433dd56e4
Author: Serhiy Storchaka <storchaka@gmail.com>
Date:   Sat Mar 21 15:53:28 2020 +0200

    bpo-39652: Truncate the column name after '[' only if PARSE_COLNAMES is set. (GH-18942)

git show also provides a --format option for specific information about a commit. The %an specifier prints the author's name.

In [4]:
%%bash
git show HEAD -n 1 | git --no-pager show --no-patch --format=%an
Serhiy Storchaka

git rev-list can be piped into a loop to git show commit information one-by-one.

In [5]:
%%bash
git rev-list HEAD -n 10 |
while read commit
do
    git --no-pager show --no-patch --format='%an' $commit
done
Serhiy Storchaka
Serhiy Storchaka
Victor Stinner
Victor Stinner
Victor Stinner
Victor Stinner
Victor Stinner
Hai Shi
amaajemyfren
Victor Stinner

Now, we have the last 10 commit authors, but we want the number of unique commit authors. The uniq command sounds promising but it only removes duplicates on adjacent lines, so we first need to sort stdin and then pipe the output of that into uniq to get unique authors.

In [6]:
%%bash
git rev-list HEAD -n 10 |
while read commit
do
    git --no-pager show --no-patch --format='%an' $commit
done |
sort |
uniq
Hai Shi
Serhiy Storchaka
Victor Stinner
amaajemyfren

Again, we can pipe the output of this to wc -l to count the number of lines in stdin, giving us the number of unique commit authors.

In [7]:
%%bash
git rev-list HEAD -n 10 |
while read commit
do
    git --no-pager show --no-patch --format='%an' $commit
done |
sort | uniq | wc -l
       4

Removing the -n 10 from git rev-list runs over the entire history.

In [8]:
%%bash
git rev-list HEAD |
while read commit
do
    git --no-pager show --no-patch --format='%an' $commit
done |
sort |
uniq |
wc -l
    1292

Who are the top 10 committers?

With only a slight modification to the above, adding a -c flag to uniq prints the number of occurrences of a line.

In [9]:
%%bash
git rev-list HEAD -n 10 |
while read commit
do
    git --no-pager show --no-patch --format='%an' $commit
done |
sort |
uniq -c
   1 Hai Shi
   2 Serhiy Storchaka
   6 Victor Stinner
   1 amaajemyfren

To get the top 10 authors in order, we can re-sort the output by piping the output of uniq -c back into sort. For this sort, we'll need to pass a -r flag to sort in reverse order (printing the largest numbers first) and a -n flag to do a numeric sort.

In [10]:
%%bash
git rev-list HEAD -n 10 |
while read commit
do git --no-pager show --no-patch --format='%an' $commit
done |
sort |
uniq -c |
sort -rn
   6 Victor Stinner
   2 Serhiy Storchaka
   1 amaajemyfren
   1 Hai Shi

Finally we can run this over the entire history and pipe the output to head to grab just the first 10 lines.

Note: this operation takes a long time due to the two massive sorts, I recommend redirecting the output to a file and running this in the background by appending >> top_authors.txt &. I've done this separately and will use the top_authors.txt file from here on out.

In [11]:
%%bash
git rev-list HEAD |
while read commit
do
    git --no-pager show --no-patch --format='%an' $commit
done |
sort |
uniq -c |
sort -rn |
head
11194 Guido van Rossum
6110 Victor Stinner
5837 Benjamin Peterson
5677 Georg Brandl
5465 Fred Drake
4159 Raymond Hettinger
4006 Serhiy Storchaka
3766 Antoine Pitrou
2978 Jack Jansen
2765 Martin v. Löwis

What % of all commits come from the top 10 commit authors?

So far we've used only a few commands combined with the power of piping and looping. We're about to look at a lot more commands that do very specific things. We'll build up slowly and look at the output of each intermediate step, but it's not very important what any individual command accomplishes. More important is understanding how data flows through pipes and that commands exist for doing any kind of data manipulation. Like any programming language, learning the specific pieces comes with repeated practice.

To get the percentage of commits from the top authors, we need to combine all of their commit counts and divide by the total number of commits. To get an individual author's commit count, we can use cut to extract the commit count from the author line, the -d ' ' specifies that the line is delimited by spaces and -f 1 specifies that we want the first column (that is, everything before the first space).

In [12]:
%%bash
cat top_authors.txt |
while read author
do
    echo $author |
    cut -d ' ' -f 1
done
11194
6110
5837
5677
5465
4159
4006
3766
2978
2765

Now, to get each author's contribution as a percentage, we can run git rev-list HEAD | wc -l and assign its output to a variable. Then, we can use awk to take each author's commit count and append *100 / $num_commits to generate an expression for each author's commit percentage.

In [13]:
%%bash
num_commits=`git rev-list HEAD | wc -l`

cat top_authors.txt |
while read author
do
    echo $author |
    cut -d ' ' -f 1 |
    awk -v num_commits="$num_commits" '{print $0 "*100 /" num_commits}'
done
11194*100 /  106683
6110*100 /  106683
5837*100 /  106683
5677*100 /  106683
5465*100 /  106683
4159*100 /  106683
4006*100 /  106683
3766*100 /  106683
2978*100 /  106683
2765*100 /  106683

To evaluate this expression, we can pipe it into bc (I think that this stands for Berkeley calculator but I couldn't find definitive evidence).

In [14]:
%%bash
num_commits=`git rev-list HEAD | wc -l`

cat top_authors.txt |
while read author
do
    echo $author |
    cut -d ' ' -f 1 |
    awk -v num_commits="$num_commits" '{print $0 "*100/" num_commits}' |
    bc -l
done
10.49276829485485035103
5.72724801514768051142
5.47134969957725223325
5.32137266481070086143
5.12265309374502029376
3.89846554746304472127
3.75505000796752997197
3.53008445581770291424
2.79144755959243740802
2.59179063205946589428

We're almost there, the last step is to sum all of these lines to get the total commit percentage. We can use paste to smash together all of the lines in stdin. The -s flag indicates that we'll be reading in just one stream of text (as opposed to pasting two together), the -d+ option pastes each line together with a + and - directs paste to read from stdin as opposed to a file.

In [15]:
%%bash
num_commits=`git rev-list HEAD | wc -l`

cat top_authors.txt |
while read author
do
    echo $author |
    cut -d ' ' -f 1 |
    awk -v num_commits="$num_commits" '{print $0 "*100/" num_commits}' |
    bc -l
done |
paste -s -d+ -
10.49276829485485035103+5.72724801514768051142+5.47134969957725223325+5.32137266481070086143+5.12265309374502029376+3.89846554746304472127+3.75505000796752997197+3.53008445581770291424+2.79144755959243740802+2.59179063205946589428

The last step is simply to pipe this back into bc to evaluate the expression.

In [16]:
%%bash
num_commits=`git rev-list HEAD | wc -l`

cat top_authors.txt |
while read author
do
    echo $author |
    cut -d ' ' -f 1 |
    awk -v num_commits="$num_commits" '{print $0 "*100/" num_commits}' |
    bc -l
done |
paste -s -d+ - |
bc -l
48.70222997103568516067

And we're done!

The Python way

Python's subprocess module makes it easy to execute arbitrary commands and read from stdout. This can be useful for writing cross-platform scripts but it also allows you to use higher-level data structures like collections.Counter for ease and readability. Take the following equivalent script to find the top 10 commit authors.

In [17]:
import subprocess
from collections import Counter
from pathlib import Path

cpython_repo = Path.home() / "code" / "cpython"
rev_list = subprocess.run(
    ["git", "rev-list", "HEAD"],
    cwd=cpython_repo,
    stdout=subprocess.PIPE,
    encoding="utf-8",
)

commit_hashes = rev_list.stdout.split("\n")[:-1]
commit_authors = Counter()

for commit_hash in commit_hashes:
    commit_author = subprocess.run(
        [
            "git", "--no-pager", "show", commit_hash,
            "--oneline", "--no-patch", "--format=%an"
        ],
        cwd=cpython_repo,
        stdout=subprocess.PIPE,
        encoding="utf-8",
    )

    commit_authors.update([commit_author.stdout.strip()])

for commit_author in commit_authors.most_common(10):
    print(commit_author)
('Guido van Rossum', 11194)
('Victor Stinner', 6110)
('Benjamin Peterson', 5837)
('Georg Brandl', 5677)
('Fred Drake', 5465)
('Raymond Hettinger', 4159)
('Serhiy Storchaka', 4006)
('Antoine Pitrou', 3766)
('Jack Jansen', 2978)
('Martin v. Löwis', 2765)

Parting thoughts

I've been heavily influenced by Gary Bernhardt's content and this post in particular is inspired by his talk on The Unix Chainsaw. Gary's website, Destroy All Software, contains a ton of great screencasts on a wide range of topics and watching him work is a mesmerizing demonstration of vim and unix mastery.

I've also found this post on Text Processing in the Unix Shell to be a great introduction and resource. For git wisdom, I always recommend Missing Semester of Your CS Education lecture on git which gives a fantastic overview of git's internals.

Finally, as promised, is the script above smashed into a single bash line.

git rev-list head | wc -l | read -d '' num_commits ; cat top_authors.txt | while read author; do echo $author | cut -d ' ' -f 1 | awk -v num_commits=$num_commits '{print $0 "*100 /" num_commits}' | bc -l; done | paste -s -d+ - | bc -l