Grok Git Repos

21 minutes read

Introduction

Git is often perceived as a necessary evil, a tool to be tolerated rather than mastered.

But treating your primary work tool as a mystery is a vulnerability. When the abstraction leaks — and it always does — the developer who understands the internals can surgically repair the damage. In contrast, those who rely on surface-level knowledge often leave behind a sloppy timeline. A history riddled with vague messages and botched merges signals a lack of care that code quality alone cannot hide.

Unfortunately, we often learn git through folklore — memorizing obscure commands to save us when things go wrong, without understanding why they work. We rely on luck rather than logic.

To escape this ritualistic dependency, we must strip away the interface. To truly grok git, we will ignore the high-level commands and build a repository from the ground up, proving that the magic is actually just simple, elegant engineering.

Creating a Git Repository from Scratch

Let's create a folder for us to work in.

$ mkdir grok
$ cd grok

Now, we have a folder. A folder isn't a git repository, as we can verify.

$ git status
fatal: not a git repository (or any of the parent directories): .git

As the status subcommand tells us, neither is the directory we currently created, of any of it parent directories a git repository.

If that wasn't the case for you, that would be because you created the folder grok as a subdirectory of a git repository.

The error seems to mention a .git folder. Turns out, when speaking of a git repository, we are actually referring to this folder in most cases[1]. So, in theory, it should be simple enough to create a git repository right, just create a .git folder and we are good to go?

$ mkdir .git
$ git status
fatal: not a git repository (or any of the parent directories): .git

Ohh, apparently not so.

If we were to look up the definition of a repository in the git glossary, we would see the following.

A collection of refs together with an object database containing all objects which are reachable from the refs, possibly accompanied by meta data from one or more porcelains. A repository can share an object database with other repositories via alternates mechanism.

Ahh, simple. Seems like we need to have some refs and an object database with all the objects refs can reach. Also, there might be meta data from porcelains?

Although the definition may seem complex, the underlying concept is straightforward. Rather than summarizing it briefly, we will demonstrate its simplicity through a practical, step-by-step example.

First, we need an object database. Well, database sounds a bit hard, we could just make a folder for it instead in our git repository.

$ mkdir .git/objects

Now, we also need a collection of refs. No issue, just make a folder for them.

$ mkdir -p .git/refs/heads

What are heads? Don't worry, we'll get to it. Now we have the bare minimum for a repository thou, right?

Well, let's try to find out!

$ git status
fatal: not a git repository (or any of the parent directories): .git

It appears we are missing a critical component.

$ echo "ref: refs/" > .git/HEAD

Now, is this finally a repository? Let's see what git thinks the status is.

$ git status
On branch main

No commits yet

nothing to commit (create/copy files and use "git add" to track)

Seems like it is! …So now we have a git repository, that was easy.

Committing to a Repository

Creating Objects

While creating a git repository manually is an interesting exercise, it is of little value unless we can populate it with files.

We likely will want to fill this repository with text files. For the occasion, you can use fortune to generate something, but if you use what I got, you'll be able to compare your hashes to the ones in the article.

"If a listener nods his head when you're explaining your program, wake him up.

That is very wise indeed. We want to add this to the repository. But how?

Well, one way to add it is the following, using the git hash-object subcommand.

$ echo "If a listener nods his head when you're explaining your program, wake him up." | git hash-object --stdin -w
665e95f1674e9466cb429bdfebaf1b8792ef0eec

Okay, two questions now. What is 665e95f1674e9466cb429bdfebaf1b8792ef0eec, and what did hash-object just do?

We can inspect the git repository to investigate. Examining the directory tree structure provides some insight.

$ tree .git
.git/
├── HEAD
├── objects
   └── 66
       └── 5e95f1674e9466cb429bdfebaf1b8792ef0eec
└── refs
    └── heads

4 directories, 2 files

Here, we see that our object “database” in the objetcs folder has seen some change. There is a folder, 66, with a file that has an equally vexing name, 5e95f1674e9466cb429bdfebaf1b8792ef0eec.

I wonder what is in this file. Let's try to take a look at it.

$ cat .git/objects/66/5e95f1674e9466cb429bdfebaf1b8792ef0eec
WKX
  Y$b   `݆!rB-8}ɢ,xSOŋE598}y?

                            %

The file does not contain plain text; it appears to be in a binary format. We can use the file command to determine its type.

$ file .git/objects/66/5e95f1674e9466cb429bdfebaf1b8792ef0eec
.git/objects/66/5e95f1674e9466cb429bdfebaf1b8792ef0eec: zlib compressed data

Apparently, it's some data compressed with zlib. That means it's likely DEFLATE compressed (RFC 1951) in a zlib wrapper (RFC 1950)[2].

There is actually a neat hack we can use to view this data[3].

$ printf "\x1f\x8b\x08\x00\x00\x00\x00\x00" | cat - .git/objects/66/5e95f1674e9466cb429bdfebaf1b8792ef0eec | gzip -dc
blob 78If a listener nods his head when you're explaining your program, wake him up.

gzip: stdin: unexpected end of file

We have located our string. The output begins with blob, which specifies the git object’s file type, followed by 78, indicating the blob’s size in bytes[4].

Objects in git have 3 primary types:

When we ran the hash-object subcommand, we created a blob. Regarding the name, that is just the sha1 hash (RFC 3174) of the object, and some other metadata and such[5]. The actual output of the command was this hash. It is 40 characters long.

The name of the directory in the object database is the first two characters of the hash, that is 66, and the actual object files name is the 38 other characters.

Manually decompressing objects for inspection is inefficient. Instead, we can use the hash 665e95f1674e9466cb429bdfebaf1b8792ef0eec to inspect the newly created object.

Inspecting Objects

We do this with the git cat-file subcommand. To see the file type of an object, we use the type flag -t.

$ git cat-file -t 665e95f1674e9466cb429bdfebaf1b8792ef0eec
blob

That means that the type (-t) of the object we made is a blob, as we saw by inspecting it. What if we wanna see the contents of the blob? We use -p for print.

$ git cat-file -p 665e95f1674e9466cb429bdfebaf1b8792ef0eec

If a listener nods his head when you're explaining your program, wake him up.

This procedure is significantly more convenient than the manual decompression hack used earlier.

Also, another useful flag to know is -s, for size.

$ git cat-file -s 665e95f1674e9466cb429bdfebaf1b8792ef0eec
78

This matches the 78 observed earlier, confirming the object size in bytes.

Index

Okay, enough about the blob. It seems like we have added an object – with some text – into the git repository now. We might wonder if this is reflected in the git status of the repo.

$ git status
On branch main

No commits yet

nothing to commit (create/copy files and use "git add" to track)

Adding an object to the database stores the data, but it does not tell git where that file belongs in your project. This is the purpose of the index (or staging area): to build the tree structure for the next commit. Without being registered in the index, our blob is essentially a “loose” object. If it remains unreferenced, it will eventually be removed by git gc (garbage collection).

To actually commit our object, we must first register it in the index.

git update-index --add --cacheinfo 100633 665e95f1674e9466cb429bdfebaf1b8792ef0eec truth.txt

The long string is the hash of the blob object we created, and truth.txt is a fitting filename.

What about 100633? Looking at man git update-index we see that…

--cacheinfo <mode>,<object>,<path>,
--cacheinfo <mode> <object> <path>

Directly insert the specified info into the index.
For backward compatibility, you can also give these
three arguments as three separate parameters, but
new users are encouraged to use a single-parameter
form.

Although the documentation suggests the single-parameter form for new users, we have used the separate parameter form here for clarity. The value 100633 is the mode of the file. In git, this mode is a 6-digit octal number that defines the file type and permissions, similar to standard UNIX file modes but restricted to a few specific values.

But crucially, for blobs in git, we only have 3 modes available[6]:

Let's take a look at the repo structure again.

$ tree .git
.git/
├── HEAD
├── index
├── objects
   └── 66
       └── 5e95f1674e9466cb429bdfebaf1b8792ef0eec
└── refs
    └── heads

4 directories, 3 files

Here we notice that there now is an index file. So all we did was add some index file? …well, slow down. Let's do a sanity check with git status.

$ git status
On branch main

No commits yet

Changes to be committed:
  (use "git rm --cached <file>..." to unstage)
    new file:   truth.txt

Changes not staged for commit:
  (use "git add/rm <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
    deleted:    truth.txt

Aha!

So we did something, we staged (added) the truth.txt file… but we also have unstaged changes saying we deleted it?

Well, yes. We have told git that the blob object we created is in the working-tree, and it has the actual blob for this in its object database… but apparently can't find the file we told it about. This makes sense, since we just created the blob from stdin and didn't actually add any files.

Because the file exists in the index but is missing from the actual working directory, git correctly assumes it has been removed.

Hmm, while we are at it, let's take a look at .git/index, just to see what that is about.

$ cat .git/index
DIRCf^gNfB  truth.txtqblr33il%

This confirms it is not a text file. We can verify its type using the file command.

$ file .git/index
.git/index: Git index, version 2, 1 entries

The output identifies it as a Git index, version 2 with 1 entries. This single entry corresponds to the file we just added. To see what this actually looks like on disk, we can inspect the raw binary data.

$ hexdump -C .git/index
44 49 52 43 00 00 00 02  00 00 00 01 00 00 00 00  |DIRC............|
00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00 00 00 00 00 00 81 a4  00 00 00 00 00 00 00 00  |................|
00 00 00 00 66 5e 95 f1  67 4e 94 66 cb 42 9b df  |....f^..gN.f.B..|
eb af 1b 87 92 ef 0e ec  00 09 74 72 75 74 68 2e  |..........truth.|
74 78 74 00 c1 b9 71 62  b4 ab c1 c2 6c 72 33 33  |txt...qb....lr33|
92 69 92 99 ac 6c 8e ff                           |.i...l..|

(I removed the offset from the output)

The first 4 bytes, DIRC, are the “magic number” — a constant value used to identify the file format (short for “Directory Cache”). Following some padding bytes[7], the 02 indicates the version number. The rest of the file contains the entry data, including the filename truth.txt.

If we were to add another file, likely patterns would emerge in the binary structure. Investigating how the index looks in a larger project is an excellent way to learn more, so I have included it in the Exercises section below.

Tree Objects

Back on track, we've now made the index thingy, and we'll have to write it with the write-tree subcommand.

$ git write-tree
a6325f064bac723691f20c0b1ed2bea82a1728fd

git status does not seem to have changed, but if we check what's in repo's file structure, we'll see something did change.

$  tree .git
.git
├── HEAD
├── index
├── objects
   ├── 66
   │   └── 5e95f1674e9466cb429bdfebaf1b8792ef0eec
   └── a6
       └── 325f064bac723691f20c0b1ed2bea82a1728fd
└── refs
    └── heads

5 directories, 4 files

We notice the a6325f064bac723691f20c0b1ed2bea82a1728fd sha-1 hash refers to a new object. Interesting, wonder what this is.

To thoroughly understand this object, we will inspect it manually by examining the raw file. This is the “hard way,” but it offers the most insight into the underlying data structure.

$ file .git/objects/a6/325f064bac723691f20c0b1ed2bea82a1728fd
.git/objects/a6/325f064bac723691f20c0b1ed2bea82a1728fd: zlib compressed data

Okay, that's zlib compressed data. We know that, so let's use that same hack to figure out what is inside of this blob.

$ printf "\x1f\x8b\x08\x00\x00\x00\x00\x00" | cat - .git/objects/a6/325f064bac723691f20c0b1ed2bea82a1728fd | gzip -dc
tree 37100644 truth.txtf^gNfB
gzip: stdin: unexpected end of file

This object is not a blob, but a tree — one of the three primary git object types. The number 37100644 initially appears confusing, but it is actually two separate values concatenated: the size (37 bytes) and the file mode (100644). Following this are the filename and a binary sequence f^gNfB.

The sequence f^gNfB matches data we previously observed in the hexdump of the .git/index file. Identifying its specific purpose is left as an exercise for the reader.

Let's look at the contents the canonical way.

$ git cat-file -p a6325f064bac723691f20c0b1ed2bea82a1728fd
100644 blob 665e95f1674e9466cb429bdfebaf1b8792ef0eec    truth.txt

An Experiment with Tree Objects

An important concept to explore is how a tree object handles multiple files. To demonstrate this, we will create another blob.

$ echo "AMOGUS" | git hash-object --stdin -w
f58617716d903fb842b5606a335ff1406b9a21d3

And add it to our index file.

$ git update-index --add --cacheinfo 100633 f58617716d903fb842b5606a335ff1406b9a21d3 amogus.txt

Now, let's look at the repository. We have 3 objects in our objects database.

$ tree .git
.git
├── HEAD
├── index
├── objects
   ├── 66
   │   └── 5e95f1674e9466cb429bdfebaf1b8792ef0eec
   ├── a6
   │   └── 325f064bac723691f20c0b1ed2bea82a1728fd
   └── f5
       └── 8617716d903fb842b5606a335ff1406b9a21d3
└── refs
    └── heads

6 directories, 5 files

Let's dump the index binary.


$ hexdump -C .git/index
44 49 52 43 00 00 00 02  00 00 00 02 00 00 00 00  |DIRC............|
00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00 00 00 00 00 00 81 a4  00 00 00 00 00 00 00 00  |................|
00 00 00 00 f5 86 17 71  6d 90 3f b8 42 b5 60 6a  |.......qm.?.B.`j|
33 5f f1 40 6b 9a 21 d3  00 0a 61 6d 6f 67 75 73  |3_.@k.!...amogus|
2e 74 78 74 00 00 00 00  00 00 00 00 00 00 00 00  |.txt............|
00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00 00 00 00 00 00 81 a4  00 00 00 00 00 00 00 00  |................|
00 00 00 00 66 5e 95 f1  67 4e 94 66 cb 42 9b df  |....f^..gN.f.B..|
eb af 1b 87 92 ef 0e ec  00 09 74 72 75 74 68 2e  |..........truth.|
74 78 74 00 54 52 45 45  00 00 00 06 00 2d 31 20  |txt.TREE.....-1 |
30 0a 94 c6 2b 91 4b 84  9d 9a 2e 3c 20 e4 3e 93  |0...+.K....< .>.|
1b 69 3f 19 3d bd                                 |.i?.=.|

Nothing too spectacular here, we notice that we see both amogus.txt and truth.txt. We also see TREE at the end of the file, that must be our tree object.

However, what happens when we run write-tree?

$ git write-tree
aee76412ed220742aeaf02ca1c50519bcea013e1

Let's dump the index again.

$ hexdump -C .git/index
44 49 52 43 00 00 00 02  00 00 00 02 00 00 00 00  |DIRC............|
00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00 00 00 00 00 00 81 a4  00 00 00 00 00 00 00 00  |................|
00 00 00 00 f5 86 17 71  6d 90 3f b8 42 b5 60 6a  |.......qm.?.B.`j|
33 5f f1 40 6b 9a 21 d3  00 0a 61 6d 6f 67 75 73  |3_.@k.!...amogus|
2e 74 78 74 00 00 00 00  00 00 00 00 00 00 00 00  |.txt............|
00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00 00 00 00 00 00 81 a4  00 00 00 00 00 00 00 00  |................|
00 00 00 00 66 5e 95 f1  67 4e 94 66 cb 42 9b df  |....f^..gN.f.B..|
eb af 1b 87 92 ef 0e ec  00 09 74 72 75 74 68 2e  |..........truth.|
74 78 74 00 54 52 45 45  00 00 00 19 00 32 20 30  |txt.TREE.....2 0|
0a ae e7 64 12 ed 22 07  42 ae af 02 ca 1c 50 51  |...d..".B.....PQ|
9b ce a0 13 e1 07 2b 99  c5 b6 0e 3d 53 33 c9 21  |......+....=S3.!|
dd b1 75 41 41 84 b1 d8  ec                       |..uAA....|

The data following the TREE identifier has increased in size. We will now inspect the contents of this newly written tree.

$ git cat-file -p aee76412ed220742aeaf02ca1c50519bcea013e1
100644 blob f58617716d903fb842b5606a335ff1406b9a21d3    amogus.txt
100644 blob 665e95f1674e9466cb429bdfebaf1b8792ef0eec    truth.txt

Now we seem to have two files inside of the tree.

Time to Commit

So now, it's time to create a commit. Above, when we did write tree, we created the tree object aee76412ed220742aeaf02ca1c50519bcea013e1. This includes both of the blobs we made.

So how do we commit that? It's actually really simple, we just use git commit-tree.

git commit-tree aee76412ed220742aeaf02ca1c50519bcea013e1 -m "initial commit"
87a1aa833dccca5ea503e9a7ff81c51fe82c85c6

If we now look at the .git repository, we will see a new object, 87a1aa833dccca5ea503e9a7ff81c51fe82c85c6 created. We can verify the type of this new object.

git cat-file -t 87a1aa833dccca5ea503e9a7ff81c51fe82c85c6
commit

Unsurprisingly, it's a commit object. If we look what is inside, we see something that may look familiar to people that regularly use git.

git cat-file -p 87a1aa833dccca5ea503e9a7ff81c51fe82c85c6
tree aee76412ed220742aeaf02ca1c50519bcea013e1
author Christina Sørensen <christina@cafkafk.com> 1715792150 +0200
committer Christina Sørensen <christina@cafkafk.com> 1715792150 +0200

initial commit

A fun thing to do now is to try and run git log.

$ git log
fatal: your current branch appears to be broken

Seems like our branch is broken hu? Let's fix that. Let's quickly remind ourself of the contents of .git/HEAD.

$ cat .git/HEAD
ref: refs/

This doesn't refer to anything. We can easily fix this, let's make a new branch. But what git subcommand will we use this time? None actually, as it turns out, branches are just files in .git/refs/heads/ that contain the sha-1 hash of some commit. The name of the file becomes the name of the branch.

$ echo 87a1aa833dccca5ea503e9a7ff81c51fe82c85c6 > .git/refs/heads/main

Now we just need to switch to that branch. Instead of doing git switch main we can just change the reference in the repository directory .git.

$ echo "ref: refs/heads/main" > .git/HEAD

If you're following along, and you have some PS1 git branch feature, you may have noticed something incredible just after running that command.

First, if we run git status now, we see that there no longer are any changes to be committed.

$ git status
On branch main
Changes not staged for commit:
  (use "git add/rm <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
    deleted:    amogus.txt
    deleted:    truth.txt

no changes added to commit (use "git add" and/or "git commit -a")

Also, we can now run git log!

git log
commit 87a1aa833dccca5ea503e9a7ff81c51fe82c85c6 (HEAD -> main)
Author: Christina Sørensen <christina@cafkafk.com>
Date:   Wed May 15 18:55:50 2024 +0200

    initial commit

Something else that's cool is we can run git log --format=raw, and see output similar to what we got from git cat-file -p on the commit objects sha-1.

$ git log --format=raw
commit 87a1aa833dccca5ea503e9a7ff81c51fe82c85c6
tree aee76412ed220742aeaf02ca1c50519bcea013e1
author Christina Sørensen <christina@cafkafk.com> 1715792150 +0200
committer Christina Sørensen <christina@cafkafk.com> 1715792150 +0200

    initial commit

$ git cat-file -p 87a1aa833dccca5ea503e9a7ff81c51fe82c85c6
tree aee76412ed220742aeaf02ca1c50519bcea013e1
author Christina Sørensen <christina@cafkafk.com> 1715792150 +0200
committer Christina Sørensen <christina@cafkafk.com> 1715792150 +0200

initial commit

But what about those files that are deleted? We can solve that with git checkout like this.

$ git checkout HEAD -- amogus.txt truth.txt

Now if we run git status we get.

$ git status
On branch main
nothing to commit, working tree clean

With this, we have successfully created a commit from scratch.

Conclusions

Starting from an empty directory, we have successfully initialized a repository, staged files, created a branch, and committed our work — all without using the standard git commands.

Is this efficient? Certainly not.

However, by bypassing the porcelain commands, we have exposed the elegant simplicity of git’s plumbing. Understanding how objects, trees, and references interact transforms the system from a black box into a logical machine. This deeper understanding will help you internalize the tool, making complex operations intuitive rather than terrifying.

There are still open questions: how does a pull work? How do we manually rebase? That is a topic for another day — or perhaps a therapy session. For now, we will leave them as an exercise for the curious reader.

Exercises

Footnotes


  1. We don't have to put our git repo in .git. We could use the GIT_DIR environment variable, or the --git-dir=<path> flag.

  2. the zlib wrapper (RFC 1950) — unlike gzip wrapper (RFC 1952) — doesn't store file name and other file system information, which is fine, considering how git manages this elsewhere.

  3. From this unix.stachexchange answer. Here, we concatenate the gzip magic number and compression method, and concatenate (the actual reason for cat existing) this with the file. We then pipe it into gzip, who can now understand and decompress it. Still, we didn't finish the file with the 8 byte footer, so gzip gets confused, but that doesn't matter, we get to see the data regardless.

  4. https://git-scm.com/book/en/v2/Git-Internals-Git-Objects

  5. If you're interested in finding out how the hash is generated, start here.

  6. Read more herehttps://git-scm.com/book/sv/v2/Git-Internals-Git-Objects

  7. The mortal enemy of C has many names: empty bytes, null bytes, nop bytes.

  8. One approach could be to use the hack, with the additional padding at the end, to extract the file. Then, after changing the number, compressing it again, and removing the prepend and appended gzip magic numbers.