Visualizing the growth of a codebase with git and tokei
tokei is a neat tool for providing statistics on a codebase. Let's look at it run over the source code of git.
import json
import subprocess
from pathlib import Path
git_repo = Path.home() / "code" / "git"
tokei_run = subprocess.run(
["tokei"],
cwd=git_repo,
stdout=subprocess.PIPE,
encoding="utf-8",
)
print(tokei_run.stdout)
It's rather amazing that the source code contains nearly 300,000 lines of C code and just as many lines of shell scripts. It's also interesting to consider how that's changed over time. We can look at this by analyzing each version of git and plotting its growth.
Let's start by looking at the tags in git's history with the
git tag
command.
git_tag = subprocess.run(
["git", "tag"],
cwd=git_repo,
stdout=subprocess.PIPE,
encoding="utf-8",
)
tags = git_tag.stdout.split("\n")
Analyzing the tags shows a few things in the raw tag output that we'd like to fix. One is that there are a number of tags for gitgui versions that we're not really interested in.
tags[:3]
Another thing that we'll want to fix is that git tags are printed in alphanumeric order, which means that the tag v2.19 comes before v2.2 even though it's a later version number.
tags[619:624]
git tag
provides two handy switches to resolve this. The first is
--list pattern
which only lists tags that match the given pattern.
The pattern v[0-9].[0-9]*.0
only matches tags that start with a v,
followed by a single digit, followed by a period, followed by one or more
digits, followed by a period, and ending with a 0.
The second handy switch is a --sort=version:refname
which will sort
the tags such that v2.19
comes after v2.2
.
git_tag = subprocess.run(
[
"git", "tag", "--list", 'v[0-9].[0-9]*.0',
"--sort=version:refname"
],
cwd=git_repo,
stdout=subprocess.PIPE,
encoding="utf-8",
)
# this call mysteriously adds a blank line at the end
# which we remove by taking everything except for
# the last element
tags = git_tag.stdout.split("\n")[:-1]
print(tags)
Now that we've got all versions captured in order, let's run tokei over v1.0.0 and see what git looked like at the time.
subprocess.run(
["git", "checkout", tags[0]],
cwd=git_repo,
encoding="utf-8",
check=True
)
tokei = subprocess.run(
["tokei"],
cwd=git_repo,
stdout=subprocess.PIPE,
encoding="utf-8"
)
print(tokei.stdout)
26,000 lines of C and nearly 15,000 lines of shell scripts are nothing to sneeze at, but let's look again at the statistics of git today.
subprocess.run(
["git", "checkout", tags[-1]],
cwd=git_repo,
encoding="utf-8",
check=True,
)
tokei = subprocess.run(
["tokei"],
cwd=git_repo,
stdout=subprocess.PIPE,
encoding="utf-8"
)
print(tokei.stdout)
At 280,000 lines of C and 260,000 lines of shell scripts, git has grown to over ten times the size of its original release in 2005!
Finally, let's use seaborn to visualize
git's growth over time. We'll check out each version one by one, run
tokei --output json
, and store the number of lines of C code in each
version.
line_lens = []
for tag in tags:
subprocess.run(
["git", "checkout", tag],
cwd=git_repo,
encoding="utf-8",
)
tokei = subprocess.run(
["tokei", "--output", "json"],
cwd=git_repo,
stdout=subprocess.PIPE,
encoding="utf-8"
)
tokei_json = json.loads(tokei.stdout)
c_lines = tokei_json["C"]["code"]
line_lens.append(c_lines)
We'll plot the tags on the x-axis (only displaying every 5th tag for space) and the number of lines of C code on the y-axis to see the rate at which git's codebase has grown by version.
import seaborn as sns
plt = sns.scatterplot(x=tags, y=line_lens)
for ind, label in enumerate(plt.get_xticklabels()):
if ind % 5 == 0: # every 5th label is kept
label.set_visible(True)
else:
label.set_visible(False)