Investigating Python's history from the shell
Who are the top 10 commit authors in CPython's history? What percent of all of CPython's commits were made by the top 10 commit authors? And can I figure this out in a single bash line?
The answer to the third question is yes (it's a very long line). Let's look at how we'd build that line to get the answers to (1) and (2). I'll actually cheat here and use multiple lines for readability, but at the very end I'll show how it can be done in a single line.
Parsing through a git repo's history is an excellent way to learn both git and bash. Git's interface is designed to be shell friendly so the same tricks involved in parsing git's output apply to parsing files and any kind of program output.
Let's look at pulling a few interesting stats out of CPython's history
- How many commits are in CPython's history?
- How many different people have committed to CPython?
- Who are the top 10 committers?
- What % of all commits were made by the top 10 committers?
%cd ~/code/cpython
How many commits are in CPython's history?¶
The all-important git rev-list
prints all commit hashes reachable from
a given commit. wc -l
counts the number of lines
in stdin so combining the two prints the total number of commits.
%%bash
git rev-list HEAD | wc -l
How many different people have committed to CPython?¶
We can use git show
to extract information about a particular commit.
The output of git show
opens in a paged view akin to less
, but the
--no-pager
flag instructs git to just send its output to stdout.
The --no-patch
flag omits the commit's diff and displays just the
commit information.
%%bash
git show HEAD -n 1 | git --no-pager show --no-patch
git show
also provides a --format
option for specific
information about a commit. The %an
specifier prints
the author's name.
%%bash
git show HEAD -n 1 | git --no-pager show --no-patch --format=%an
git rev-list
can be piped into a loop to git show
commit
information one-by-one.
%%bash
git rev-list HEAD -n 10 |
while read commit
do
git --no-pager show --no-patch --format='%an' $commit
done
Now, we have the last 10 commit authors, but we want the number of
unique commit authors. The uniq
command sounds promising but it only
removes duplicates on adjacent lines, so we first need to sort
stdin
and then pipe the output of that into uniq
to get unique authors.
%%bash
git rev-list HEAD -n 10 |
while read commit
do
git --no-pager show --no-patch --format='%an' $commit
done |
sort |
uniq
Again, we can pipe the output of this to wc -l
to count the
number of lines in stdin, giving us the number of unique commit
authors.
%%bash
git rev-list HEAD -n 10 |
while read commit
do
git --no-pager show --no-patch --format='%an' $commit
done |
sort | uniq | wc -l
Removing the -n 10
from git rev-list
runs over the entire
history.
%%bash
git rev-list HEAD |
while read commit
do
git --no-pager show --no-patch --format='%an' $commit
done |
sort |
uniq |
wc -l
Who are the top 10 committers?¶
With only a slight modification to the above, adding a -c
flag
to uniq
prints the number of occurrences of a line.
%%bash
git rev-list HEAD -n 10 |
while read commit
do
git --no-pager show --no-patch --format='%an' $commit
done |
sort |
uniq -c
To get the top 10 authors in order, we can re-sort the output by
piping the output of uniq -c
back into sort
. For this sort,
we'll need to pass a -r
flag to sort in reverse order (printing
the largest numbers first) and a -n
flag to do a numeric sort.
%%bash
git rev-list HEAD -n 10 |
while read commit
do git --no-pager show --no-patch --format='%an' $commit
done |
sort |
uniq -c |
sort -rn
Finally we can run this over the entire history and pipe the output
to head
to grab just the first 10 lines.
Note: this operation takes a long time due to the two massive sorts,
I recommend redirecting the output to a file and running this in the
background by appending >> top_authors.txt &
. I've done this
separately and will use the top_authors.txt file from here on out.
%%bash
git rev-list HEAD |
while read commit
do
git --no-pager show --no-patch --format='%an' $commit
done |
sort |
uniq -c |
sort -rn |
head
What % of all commits come from the top 10 commit authors?¶
So far we've used only a few commands combined with the power of piping and looping. We're about to look at a lot more commands that do very specific things. We'll build up slowly and look at the output of each intermediate step, but it's not very important what any individual command accomplishes. More important is understanding how data flows through pipes and that commands exist for doing any kind of data manipulation. Like any programming language, learning the specific pieces comes with repeated practice.
To get the percentage of commits from the top authors, we need to
combine all of their commit counts and divide by the total number
of commits. To get an individual author's commit count, we can
use cut
to extract the commit count from the author line, the -d ' '
specifies that the line is delimited by spaces and -f 1
specifies
that we want the first column (that is, everything before the first space).
%%bash
cat top_authors.txt |
while read author
do
echo $author |
cut -d ' ' -f 1
done
Now, to get each author's contribution as a percentage, we can run
git rev-list HEAD | wc -l
and assign its output to a variable.
Then, we can use awk
to take each author's commit count and append
*100 / $num_commits
to generate an expression for each author's
commit percentage.
%%bash
num_commits=`git rev-list HEAD | wc -l`
cat top_authors.txt |
while read author
do
echo $author |
cut -d ' ' -f 1 |
awk -v num_commits="$num_commits" '{print $0 "*100 /" num_commits}'
done
To evaluate this expression, we can pipe it into bc
(I think that this stands for
Berkeley calculator but I couldn't find definitive evidence).
%%bash
num_commits=`git rev-list HEAD | wc -l`
cat top_authors.txt |
while read author
do
echo $author |
cut -d ' ' -f 1 |
awk -v num_commits="$num_commits" '{print $0 "*100/" num_commits}' |
bc -l
done
We're almost there, the last step is to sum all of these lines to get
the total commit percentage. We can use paste
to smash together all
of the lines in stdin. The -s
flag indicates that we'll be reading in just
one stream of text (as opposed to pasting two together), the -d+
option pastes each line together with a +
and -
directs paste
to read from stdin as opposed to a file.
%%bash
num_commits=`git rev-list HEAD | wc -l`
cat top_authors.txt |
while read author
do
echo $author |
cut -d ' ' -f 1 |
awk -v num_commits="$num_commits" '{print $0 "*100/" num_commits}' |
bc -l
done |
paste -s -d+ -
The last step is simply to pipe this back into bc
to evaluate
the expression.
%%bash
num_commits=`git rev-list HEAD | wc -l`
cat top_authors.txt |
while read author
do
echo $author |
cut -d ' ' -f 1 |
awk -v num_commits="$num_commits" '{print $0 "*100/" num_commits}' |
bc -l
done |
paste -s -d+ - |
bc -l
And we're done!
The Python way¶
Python's subprocess module makes it easy to execute arbitrary commands
and read from stdout. This can be useful for writing cross-platform
scripts but it also allows you to use higher-level data structures like
collections.Counter
for ease and readability. Take the following
equivalent script to find the top 10 commit authors.
import subprocess
from collections import Counter
from pathlib import Path
cpython_repo = Path.home() / "code" / "cpython"
rev_list = subprocess.run(
["git", "rev-list", "HEAD"],
cwd=cpython_repo,
stdout=subprocess.PIPE,
encoding="utf-8",
)
commit_hashes = rev_list.stdout.split("\n")[:-1]
commit_authors = Counter()
for commit_hash in commit_hashes:
commit_author = subprocess.run(
[
"git", "--no-pager", "show", commit_hash,
"--oneline", "--no-patch", "--format=%an"
],
cwd=cpython_repo,
stdout=subprocess.PIPE,
encoding="utf-8",
)
commit_authors.update([commit_author.stdout.strip()])
for commit_author in commit_authors.most_common(10):
print(commit_author)
Parting thoughts¶
I've been heavily influenced by Gary Bernhardt's content and this post in particular is inspired by his talk on The Unix Chainsaw. Gary's website, Destroy All Software, contains a ton of great screencasts on a wide range of topics and watching him work is a mesmerizing demonstration of vim and unix mastery.
I've also found this post on Text Processing in the Unix Shell to be a great introduction and resource. For git wisdom, I always recommend Missing Semester of Your CS Education lecture on git which gives a fantastic overview of git's internals.
Finally, as promised, is the script above smashed into a single bash line.
git rev-list head | wc -l | read -d '' num_commits ; cat top_authors.txt | while read author; do echo $author | cut -d ' ' -f 1 | awk -v num_commits=$num_commits '{print $0 "*100 /" num_commits}' | bc -l; done | paste -s -d+ - | bc -l