Running garbage collection manually

When using Git on a regular basis, you might notice that some commands sometimes trigger Git to perform garbage collection and pack loose objects into a pack file (Git's objects storage). The garbage collection and packing of loose objects can also be triggered manually by executing the git gc command. Triggering git gc is useful if you have a lot of loose objects. A loose object can, for example, be a blob or a tree or a commit. As we saw in Chapter 1, Navigating Git, blob-, tree-, and commit objects are added to Git's database when we add files and create commits. These objects will first be stored as loose objects in Git's object storage as single files inside the .git/objects folder. Eventually, or by manual request, Git packs the loose objects into pack files that can reduce disk usage. A lot of loose objects can happen after adding a lot of files to Git, for example, when starting a new project or after frequent adds and commits. Running the garbage collection will make sure loose objects are being packed, and objects not referred to by any reference or object will be deleted. The latter is useful when you have deleted some branches/commits and want to make sure the objects referenced by these are also deleted.

Let's see how we can trigger garbage collection and remove some objects from the database.

Getting ready

First, we need a repository to perform the garbage collection on. We'll use the same repository as the previous example:

$ git clone https://github.com/dvaske/hello_world_flow_model.git 
$ cd hello_world_flow_model
$ git checkout develop
$ git reset --hard origin/develop

How to do it...

  1. First, we'll check the repository for loose objects; we can do this with the count-objects command:
    $ git count-objects
    51 objects, 204 kilobytes
    
  2. We'll also check for unreachable objects, which are objects that can't be reached from any reference (tag, branch, or other object). The unreachable objects will be deleted when the garbage collect runs. We also check the size of the .git directory using the following command:
    $ git fsck --unreachable
    Checking object directories: 100% (256/256), done.
    $ du -sh .git
    292K  .git
    
  3. There are no unreachable objects. This is because we just cloned and haven't actually worked in the repository. If we delete the origin remote, the remote branches (remotes/origin/*) will be deleted, and we'll lose the reference to some of the objects in the repository; they'll be displayed as unreachable while running fsck and can be garbage collected:
    $ git remote rm origin
    $ git fsck --unreachable
    Checking object directories: 100% (256/256), done.
    unreachable commit 127c621039928c5d99e4221564091a5bf317dc27
    unreachable commit 472a3dd2fda0c15c9f7998a98f6140c4a3ce4816
    unreachable blob e26174ff5c0a3436454d0833f921943f0fc78070
    unreachable commit f336166c7812337b83f4e62c269deca8ccfa3675
    
  4. We can see that we have some unreachable objects due to the deletion of the remote. Let's try to trigger garbage collection manually:
    $ git gc
    Counting objects: 46, done.
    Delta compression using up to 8 threads.
    Compressing objects: 100% (44/44), done.
    Writing objects: 100% (46/46), done.
    Total 46 (delta 18), reused 0 (delta 0)
    
  5. If we investigate the repository now, we can see the following:
    $ git count-objects
    5 objects, 20 kilobytes
    $ git fsck --unreachable
    Checking object directories: 100% (256/256), done.
    Checking objects: 100% (46/46), done.
    unreachable commit 127c621039928c5d99e4221564091a5bf317dc27
    unreachable commit 472a3dd2fda0c15c9f7998a98f6140c4a3ce4816
    unreachable blob e26174ff5c0a3436454d0833f921943f0fc78070
    unreachable commit f336166c7812337b83f4e62c269deca8ccfa3675
    $ du -sh .git
    120K  .git
    
  6. The object count is smaller; Git packed the objects to the pack-file stored in the .git/objects/pack folder. The size of the repository is also smaller as Git compresses and optimizes the objects in the pack-file. However, there are still some unreachable objects left. This is because the objects will only be deleted if they are older than what is specified in the gc.pruneexpire configuration option that defaults to two weeks (config value: 2.weeks.ago). We can override the default or configured option by running the --prune=now option:
    $ git gc --prune=now
    Counting objects: 46, done.
    Delta compression using up to 8 threads.
    Compressing objects: 100% (26/26), done.
    Writing objects: 100% (46/46), done.
    Total 46 (delta 18), reused 46 (delta 18)
    
  7. Investigating the repository gives the following output:
    $ git count-objects
    0 objects, 0 kilobytes
    $ git fsck --unreachable
    Checking object directories: 100% (256/256), done.
    Checking objects: 100% (46/46), done.
    $ du -sh .git
    100K  .git
    

The unreachable objects have been deleted, there are no loose objects, and the repository size is smaller now that the objects have been deleted.

How it works…

The git gc command optimizes the repository by compressing file revisions and deleting objects that are no longer referred to. The objects can be commits and so on. On an abandoned (deleted) branch, blobs from invocations of git add, commits discarded/redone with git commit –amend, or other commands that can leave objects behind. Objects are, by default, already compressed with zlib when they are created, and when moved into the pack-file, Git makes sure only to store the necessary change. If, for example, you change only a single line in a large file, it would waste a bit of space while storing the entire file in the pack-file again. Instead, Git stores the newest file as a whole in the pack-file and only the delta for the older version. This is pretty smart as you are more likely to require the newest version of the file, and Git doesn't have to do delta calculations for this. This might seem like a contradiction to the information from Chapter 1, Navigating Git, where we learned that Git stores snapshots and not deltas. However, remember how the snapshot is made. Git hashes all the files content in blobs, makes tree and commit objects, and the commit object describes the full tree state with the root-tree sha-1 hash. The storing of the objects inside the pack-files have no effect on the computation of the tree state. When you checkout an earlier version or commit, Git makes sure the sha-1 hashes match the branch or commit or tag you requested.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset