Grok Git Repos
21 minutes read
Introduction
Git is often perceived as a necessary evil, a tool to be tolerated rather than mastered.
But treating your primary work tool as a mystery is a vulnerability. When the abstraction leaks — and it always does — the developer who understands the internals can surgically repair the damage. In contrast, those who rely on surface-level knowledge often leave behind a sloppy timeline. A history riddled with vague messages and botched merges signals a lack of care that code quality alone cannot hide.
Unfortunately, we often learn git through folklore — memorizing obscure commands to save us when things go wrong, without understanding why they work. We rely on luck rather than logic.
To escape this ritualistic dependency, we must strip away the interface. To truly grok git, we will ignore the high-level commands and build a repository from the ground up, proving that the magic is actually just simple, elegant engineering.
Creating a Git Repository from Scratch
Let's create a folder for us to work in.
Now, we have a folder. A folder isn't a git repository, as we can verify.
)
As the status subcommand tells us, neither is the directory we currently created, of any of it parent directories a git repository.
If that wasn't the case for you, that would be because you created the folder
grokas a subdirectory of a git repository.
The error seems to mention a .git folder. Turns out, when speaking of a git repository, we are actually referring to this folder in most cases[1]. So, in theory, it should be simple enough to create a git repository right, just create a .git folder and we are good to go?
)
Ohh, apparently not so.
If we were to look up the definition of a repository in the git glossary, we would see the following.
A collection of refs together with an object database containing all objects which are reachable from the refs, possibly accompanied by meta data from one or more porcelains. A repository can share an object database with other repositories via alternates mechanism.
Ahh, simple. Seems like we need to have some refs and an object database with all the objects refs can reach. Also, there might be meta data from porcelains?
Although the definition may seem complex, the underlying concept is straightforward. Rather than summarizing it briefly, we will demonstrate its simplicity through a practical, step-by-step example.
First, we need an object database. Well, database sounds a bit hard, we could just make a folder for it instead in our git repository.
Now, we also need a collection of refs. No issue, just make a folder for them.
What are heads? Don't worry, we'll get to it. Now we have the bare minimum for a repository thou, right?
Well, let's try to find out!
)
It appears we are missing a critical component.
Now, is this finally a repository? Let's see what git thinks the status is.
)
Seems like it is! …So now we have a git repository, that was easy.
Committing to a Repository
Creating Objects
While creating a git repository manually is an interesting exercise, it is of little value unless we can populate it with files.
We likely will want to fill this repository with text files. For the occasion, you can use fortune to generate something, but if you use what I got, you'll be able to compare your hashes to the ones in the article.
"If a listener nods his head when you're explaining your program, wake him up.
That is very wise indeed. We want to add this to the repository. But how?
Well, one way to add it is the following, using the git hash-object subcommand.
$ echo "If a listener nods his head when you're explaining your program, wake him up." | git hash-object --stdin -w
665e95f1674e9466cb429bdfebaf1b8792ef0eec
Okay, two questions now. What is 665e95f1674e9466cb429bdfebaf1b8792ef0eec, and what did hash-object just do?
We can inspect the git repository to investigate. Examining the directory tree structure provides some insight.
Here, we see that our object “database” in the objetcs folder has seen some change. There is a folder, 66, with a file that has an equally vexing name, 5e95f1674e9466cb429bdfebaf1b8792ef0eec.
I wonder what is in this file. Let's try to take a look at it.
The file does not contain plain text; it appears to be in a binary format. We can use the file command to determine its type.
Apparently, it's some data compressed with zlib. That means it's likely DEFLATE compressed (RFC 1951) in a zlib wrapper (RFC 1950)[2].
There is actually a neat hack we can use to view this data[3].
| |
We have located our string. The output begins with blob, which specifies the git object’s file type, followed by 78, indicating the blob’s size in bytes[4].
Objects in git have 3 primary types:
- blob
- tree
- commit
When we ran the hash-object subcommand, we created a blob. Regarding the name, that is just the sha1 hash (RFC 3174) of the object, and some other metadata and such[5]. The actual output of the command was this hash. It is 40 characters long.
The name of the directory in the object database is the first two characters of the hash, that is 66, and the actual object files name is the 38 other characters.
Manually decompressing objects for inspection is inefficient. Instead, we can use the hash 665e95f1674e9466cb429bdfebaf1b8792ef0eec to inspect the newly created object.
Inspecting Objects
We do this with the git cat-file subcommand. To see the file type of an object, we use the type flag -t.
That means that the type (-t) of the object we made is a blob, as we saw by inspecting it. What if we wanna see the contents of the blob? We use -p for print.
This procedure is significantly more convenient than the manual decompression hack used earlier.
Also, another useful flag to know is -s, for size.
This matches the 78 observed earlier, confirming the object size in bytes.
Index
Okay, enough about the blob. It seems like we have added an object – with some text – into the git repository now. We might wonder if this is reflected in the git status of the repo.
)
Adding an object to the database stores the data, but it does not tell git where that file belongs in your project. This is the purpose of the index (or staging area): to build the tree structure for the next commit. Without being registered in the index, our blob is essentially a “loose” object. If it remains unreferenced, it will eventually be removed by git gc (garbage collection).
To actually commit our object, we must first register it in the index.
The long string is the hash of the blob object we created, and truth.txt is a fitting filename.
What about 100633? Looking at man git update-index we see that…
--cacheinfo <mode>,<object>,<path>,
--cacheinfo <mode> <object> <path>
Directly insert the specified info into the index.
For backward compatibility, you can also give these
three arguments as three separate parameters, but
new users are encouraged to use a single-parameter
form.
Although the documentation suggests the single-parameter form for new users, we have used the separate parameter form here for clarity. The value 100633 is the mode of the file. In git, this mode is a 6-digit octal number that defines the file type and permissions, similar to standard UNIX file modes but restricted to a few specific values.
But crucially, for blobs in git, we only have 3 modes available[6]:
- 100644 a normal file
- 100755 a executable file
- 120000 a symbolic link
Let's take a look at the repo structure again.
Here we notice that there now is an index file. So all we did was add some index file? …well, slow down. Let's do a sanity check with git status.
()
()
()
Aha!
So we did something, we staged (added) the truth.txt file… but we also have unstaged changes saying we deleted it?
Well, yes. We have told git that the blob object we created is in the working-tree, and it has the actual blob for this in its object database… but apparently can't find the file we told it about. This makes sense, since we just created the blob from stdin and didn't actually add any files.
Because the file exists in the index but is missing from the actual working directory, git correctly assumes it has been removed.
Hmm, while we are at it, let's take a look at .git/index, just to see what that is about.
This confirms it is not a text file. We can verify its type using the file command.
The output identifies it as a Git index, version 2 with 1 entries. This single entry corresponds to the file we just added. To see what this actually looks like on disk, we can inspect the raw binary data.
||
||
||
||
||
||
||
(I removed the offset from the output)
The first 4 bytes, DIRC, are the “magic number” — a constant value used to identify the file format (short for “Directory Cache”). Following some padding bytes[7], the 02 indicates the version number. The rest of the file contains the entry data, including the filename truth.txt.
If we were to add another file, likely patterns would emerge in the binary structure. Investigating how the index looks in a larger project is an excellent way to learn more, so I have included it in the Exercises section below.
Tree Objects
Back on track, we've now made the index thingy, and we'll have to write it with the write-tree subcommand.
git status does not seem to have changed, but if we check what's in repo's file structure, we'll see something did change.
We notice the a6325f064bac723691f20c0b1ed2bea82a1728fd sha-1 hash refers to a new object. Interesting, wonder what this is.
To thoroughly understand this object, we will inspect it manually by examining the raw file. This is the “hard way,” but it offers the most insight into the underlying data structure.
Okay, that's zlib compressed data. We know that, so let's use that same hack to figure out what is inside of this blob.
| |
This object is not a blob, but a tree — one of the three primary git object types. The number 37100644 initially appears confusing, but it is actually two separate values concatenated: the size (37 bytes) and the file mode (100644). Following this are the filename and a binary sequence f^gNfB.
The sequence f^gNfB matches data we previously observed in the hexdump of the .git/index file. Identifying its specific purpose is left as an exercise for the reader.
Let's look at the contents the canonical way.
An Experiment with Tree Objects
An important concept to explore is how a tree object handles multiple files. To demonstrate this, we will create another blob.
|
And add it to our index file.
Now, let's look at the repository. We have 3 objects in our objects database.
Let's dump the index binary.
||
||
||
|Nothing too spectacular here, we notice that we see both amogus.txt and truth.txt. We also see TREE at the end of the file, that must be our tree object.
However, what happens when we run write-tree?
Let's dump the index again.
||
||
||
|The data following the TREE identifier has increased in size. We will now inspect the contents of this newly written tree.
Now we seem to have two files inside of the tree.
Time to Commit
So now, it's time to create a commit. Above, when we did write tree, we created the tree object aee76412ed220742aeaf02ca1c50519bcea013e1. This includes both of the blobs we made.
So how do we commit that? It's actually really simple, we just use git commit-tree.
If we now look at the .git repository, we will see a new object, 87a1aa833dccca5ea503e9a7ff81c51fe82c85c6 created. We can verify the type of this new object.
Unsurprisingly, it's a commit object. If we look what is inside, we see something that may look familiar to people that regularly use git.
A fun thing to do now is to try and run git log.
Seems like our branch is broken hu? Let's fix that. Let's quickly remind ourself of the contents of .git/HEAD.
This doesn't refer to anything. We can easily fix this, let's make a new branch. But what git subcommand will we use this time? None actually, as it turns out, branches are just files in .git/refs/heads/ that contain the sha-1 hash of some commit. The name of the file becomes the name of the branch.
Now we just need to switch to that branch. Instead of doing git switch main we can just change the reference in the repository directory .git.
If you're following along, and you have some PS1 git branch feature, you may have noticed something incredible just after running that command.
First, if we run git status now, we see that there no longer are any changes to be committed.
()
()
)
Also, we can now run git log!
)
Something else that's cool is we can run git log --format=raw, and see output similar to what we got from git cat-file -p on the commit objects sha-1.
But what about those files that are deleted? We can solve that with git checkout like this.
Now if we run git status we get.
With this, we have successfully created a commit from scratch.
Conclusions
Starting from an empty directory, we have successfully initialized a repository, staged files, created a branch, and committed our work — all without using the standard git commands.
Is this efficient? Certainly not.
However, by bypassing the porcelain commands, we have exposed the elegant simplicity of git’s plumbing. Understanding how objects, trees, and references interact transforms the system from a black box into a logical machine. This deeper understanding will help you internalize the tool, making complex operations intuitive rather than terrifying.
There are still open questions: how does a pull work? How do we manually rebase? That is a topic for another day — or perhaps a therapy session. For now, we will leave them as an exercise for the curious reader.
Exercises
- What does the
hexdump -Clook like of the.git/indexif we add another blob withhash-objectandupdate-index? - Can you discern any patterns from this. What more can you learn about the index file format.
- What does the zlib compressed data look like inside of the tree object after if we add another blob?
- What does the
.git/indexlook like after runningupdate-indexwith a tree blob in the object database? - What does the
.git/indexlook like after runningupdate-indexandwrite-treewith a tree blob in the object database? - Can you figure out what the myserious
f^gNfBmeans? - What does the
hexdump -Clook like of the.git/indexinside a larger project. - Can you discern any patterns from this. What more can you learn about the index file format.
- Does changing the size of the blob from 78 to some other number inside of the zlib compressed data influence
git cat-file -s? You will need to decompress and recompress the data from and into the proper zlib format[8].
Footnotes
We don't have to put our git repo in
.git. We could use theGIT_DIRenvironment variable, or the--git-dir=<path>flag. ↩the zlib wrapper (RFC 1950) — unlike gzip wrapper (RFC 1952) — doesn't store file name and other file system information, which is fine, considering how git manages this elsewhere. ↩
From this unix.stachexchange answer. Here, we concatenate the gzip magic number and compression method, and concatenate (the actual reason for cat existing) this with the file. We then pipe it into gzip, who can now understand and decompress it. Still, we didn't finish the file with the 8 byte footer, so gzip gets confused, but that doesn't matter, we get to see the data regardless. ↩
If you're interested in finding out how the hash is generated, start here. ↩
Read more herehttps://git-scm.com/book/sv/v2/Git-Internals-Git-Objects ↩
The mortal enemy of C has many names: empty bytes, null bytes, nop bytes. ↩
One approach could be to use the hack, with the additional padding at the end, to extract the file. Then, after changing the number, compressing it again, and removing the prepend and appended gzip magic numbers. ↩
