A Journey Into GIT Inner Workings

Things as an engineer you should know!

Sanjit Mohanty
FAUN — Developer Community 🐾

--

As engineers, we often utilise Git as an essential tool for version control, but have you ever wondered what goes on behind the scenes?

In this blog, we’ll dive into the depths of Git’s architecture, and learn GIT’s fundamental objects & metadata. We’ll also demystify the building blocks that enable Git’s remarkable capabilities. We’ll then take a typical developer’s GIT workflow and see under the hood what goes behind.

Whether you’re a seasoned developer or just starting your journey, this blog will provide you with a solid foundation to navigate the intricacies of Git’s internals.

Blocks building GIT

In Git, there’re several types of objects that make up the internal structure of a repository.

In a simplified scenario, a single commit references a single tree object and blobs. However, in reality, a Git repository can have multiple commits, each with its own tree object and references to blobs or other sub-trees. Additionally, tags and annotated tags would be represented as separate boxes with their references to commits.

GIT repository with Multiple commits referrencing each other

Lets understand each of these GIT building blocks -

GIT building blocks
  1. Commits: The topmost box represents a commit object, consisting of a snapshot of the entire project at a specific point in time. They contain metadata such as —
  • Commit author
  • Commit timestamp
  • Commit message
  • and a reference to a tree object representing the project’s directory structure at that particular commit.

Each commit is identified by its SHA-1 hash which is based on it’s content, including the associated tree object(s) and parent commit(s).

2. Trees: The next level box represents a tree object, which represents the project’s directory structure at a specific commit. A tree object contains references to blobs (file contents) and other tree objects (subdirectories) not shown in the above simplified diagram. Trees are identified by its SHA-1 hash.

3. Blobs (Binary Large Objects): The bottom boxes represent blobs, which are the contents of files in the repository. They are immutable and store the file data as binary objects. Each blob is uniquely identified by its SHA-1 hash.

4. Tags: Tags are used to create human-readable names for specific commits. They serve as references to particular points in the commit history. Tags can be lightweight (a simple reference) or annotated (including additional metadata like author, date, and a message).

These objects, along with the various data structures used by Git, enable the version control capabilities and efficient storage of project history.

GIT Architecture

The internal architecture of Git revolves around three main components —

High level git user workflows
  1. Object Store: Git stores all its data, including files, commits, and metadata, in a content-addressable object store. This store is a simple key-value database where each object is identified by its SHA-1 hash, which is computed based on the content of the object. Objects can be of four types: blobs (file contents), trees (directory structure), commits (snapshots of the project), and tags (references to specific commits). We’ve learnt about each of these in the previous section.
  2. Index aka Staging area : The index, also known as the staging area, acts as a bridge between the working directory and the object store. It contains a snapshot of the project’s current state, representing the next commit. The index is responsible for tracking changes to files and staging them for commit. When you make changes to files, Git first updates the index, and then you can commit those changes.
  3. Ref Log: Git maintains a ref log that records the history of branch and tag references. It stores a series of pointers to specific commits, enabling you to track the progress of branches and tags over time. The ref log is essential for undoing changes, recovering lost commits, and navigating the project’s history.

A typical developer GIT workflow

With above context, let us now understand behind-the-scene of the following typical GIT developer workflow -

#1: GIT CLONE

When you run git clone to clone a remote repository, Git performs several internal operations. Here's a high-level overview of what happens behind the scene —

  1. Requesting and receiving data: Git contacts the remote repository specified in the clone command and requests the necessary data to create a local copy. This typically involves network communication using protocols like HTTP, SSH, or Git’s native protocol.
  2. Creating the local repository: Git creates a new directory on your local machine, which will serve as the root of the cloned repository.
  3. Object transfer: Git transfers all the objects (commits, trees, blobs, and tags) from the remote repository to your local repository. It uses a compressed and efficient transfer protocol to minimise network overhead.
  4. Building the object database: Once the objects are received, Git stores them in its content-addressable object store, which is located in the .git/objects directory of the local repository. Each object is identified by its unique SHA-1 hash, computed based on its content.
  5. Creating branches and references: Git creates local branches that match the remote branches and sets up remote tracking branches. It also sets up references to tags and other named commits. These references are stored in the .git/refs directory.
  6. Checking out files: Git populates your working directory with the latest version of the files from the cloned repository. The working directory contains the actual files you can edit and work with.
  7. Setting up the staging area: Git initialises the index (also known as the staging area) to match the state of the checked-out branch. The index serves as a snapshot of the project’s current state and tracks changes before committing them.

After these steps, your local repository is a fully functional clone of the remote repository, with its complete history, branches, and files. You can now work with the repository locally, make changes, commit them, and interact with the remote repository using Git commands.

#2: GIT CHECKOUT

When you run git checkout -b <branch_name> in Git, it allows you to switch branches. The internal processes involves the following —

  1. Git updates the HEAD pointer, which points to the currently checked-out branch.
  2. Git updates the index (staging area) and the working directory to match the state of the newly checked-out branch. This means that the files in your working directory are replaced with the versions from the newly checked-out branch.
  3. Git may modify or delete files that are not present in the newly checked-out branch.

Overall, git checkout internally updates various pointers and the state of the index and working directory to reflect the desired branch. It enables you to navigate the project's history, switch between branches, and restore files to specific states.

#3: GIT ADD

When you execute the command git add in Git, it performs the following actions —

  1. Staging Changes: The git add command is used to stage changes, which means it prepares the modifications in your working directory to be included in the next commit.
  2. Updating the Index: The changes you want to commit are added to the Git index, also known as the staging area. The index serves as a snapshot of the files that will be included in the next commit.
  3. Tracking File Modifications: When you run git add on a file, Git detects the modifications made to that file since the last commit and updates the index accordingly. This process ensures that Git tracks the changes and includes them in the subsequent commit.
  4. Adding New Files: If you use git add on a new file that Git hasn't tracked before, it adds the file to the repository and stages it for the next commit. This allows Git to start tracking the file's changes.
  5. Removing Deleted Files: If you delete a file from your working directory and then use git add, Git detects the deletion and stages the file for removal in the next commit. This ensures that the file's deletion is recorded in the repository.
  6. Ignoring Files: Git respects the rules defined in the .gitignore file, which lists patterns of files and directories that should be ignored. When you run git add, Git respects these rules and excludes the ignored files from being staged.
  7. Interacting with Git Status: After running git add, if you use the git status command, you'll see the changes you added in the "Changes to be committed" section. This indicates that the changes are staged and ready to be included in the next commit.

By using git add, you selectively stage changes and prepare them for the next commit. This allows you to carefully control which modifications are included in the commit, enabling a more focused and organised approach to version control.

#4: GIT COMMIT

When you execute the command git commit, several actions occur in the Git repository. Here's an overview of what happens when you commit changes:

  1. Creating a Commit Object: When you run git commit, Git creates a new commit object that represents the snapshot of the project at that particular point in time. The commit object contains metadata, including the author, committer, timestamp, and commit message.
  2. Capturing the Current State: The commit object references the current state of the project’s directory structure by pointing to a tree object. The tree object represents the state of the project’s files and directories at the time of the commit.
  3. Creating Parent-Child Relationship: Each commit except the initial commit typically has one or more parent commits. The commit object stores references to the parent commit(s), forming a linked list that represents the history of the project. This allows Git to track the order and relationship between commits.
  4. Generating a Unique SHA-1 Hash: Git generates a unique SHA-1 hash for the commit object based on its content and metadata. This hash serves as a unique identifier for the commit and ensures data integrity within the repository.
  5. Updating the Branch Reference: If you are currently on a branch, Git updates the branch reference to point to the newly created commit. This moves the branch pointer forward to the latest commit, effectively marking it as the new “tip” of the branch.
  6. Persisting the Objects: Git stores the commit object, the referenced tree object, and any necessary blob objects (representing the file contents) in the Git object database. The objects are stored by their SHA-1 hashes, making them immutable and ensuring the integrity of the repository.

By committing changes, you create a new snapshot of the project’s state, preserving the history and allowing for easy tracking of changes. The commit becomes a permanent part of the repository’s object database, forming the foundation for further version control operations, such as branching, merging, and reverting to previous states.

#5: GIT PUSH

When you run git push in Git, it transfers your local commits and updates to a remote repository. The internal processes involved in git push are as follows —

  1. Authenticating with the remote repository: Git verifies your credentials to ensure you have the necessary permissions to push changes to the remote repository. This typically involves authentication protocols like SSH or HTTPS.
  2. Determining the changes to push: Git examines the commits on the current branch in your local repository and identifies the new commits that are not yet present in the remote repository. It determines the specific changes required to update the remote branch.
  3. Transmitting the data: Git compresses the new commits, along with their associated objects (trees, blobs), into a pack file. This pack file contains a delta representation of the changes, optimising the transfer process. The pack file is then sent to the remote repository over the network.
  4. Updating the remote branch: Upon receiving the pack file, the remote repository unpacks and integrates the new commits into the appropriate branch. The branch pointer is moved to the latest commit, reflecting the updates made from your local repository.
  5. Handling conflicts: If other contributors have made conflicting changes to the remote branch since your last fetch or pull, Git will pause the push process and notify you of any conflicts. You’ll need to resolve these conflicts before proceeding with the push.
  6. Updating remote tracking branches: Once the push is successful, Git updates the remote tracking branch associated with your local branch to reflect the new state of the remote branch. This allows your local repository to stay in sync with the remote repository.

By executing git push, you publish your local commits and changes, making them available to other collaborators working on the same remote repository. It allows for collaborative development, enabling the synchronisation of work across multiple team members.

Parting Thoughts

Git is a powerful and versatile version control system that offers numerous advantages for developers and teams working on software projects. As you embark on your journey with Git, continually expanding your knowledge and mastering Git’s capabilities will empower you to take full advantage of this powerful version control system.

👋 If you find this helpful, please click the clap 👏 button below a few times to show your support for the author 👇

🚀Join FAUN Developer Community & Get Similar Stories in your Inbox Each Week

--

--

Engineering Manager, Broadcom | Views expressed on my blogs are solely mine; not that of present/past employers. Support my work https://ko-fi.com/sanjitmohanty