Storing binaries elsewhere

Though binaries can't easily be diffed, there is nothing to prevent them from being stored in a Git repository, and there are no issues in doing so. However, if one or more binaries in a repository are updated frequently, it can cause the repository to grow quickly in size, making clones and updates slow as a lot of data needs to be transferred. By using the clean and smudge filters for the binaries, it is possible to move them to another location while adding them to Git and fetch them from that location while checking out the specific version of the file.

Getting ready

We'll use the same repositories as in the previous example, but the no_binaries branch:

$ git clone https://github.com/dvaske/attributes_example.git
$ cd attributes_example
$ git checkout no_binaries

How to do it...

First, we need to set up the clean and smudge filters for the files. Then, we are only going to run the filter on jpg files in this example, so let's set it up and create the configuration:

$ echo '*.jpg filter=binstore' > .gitattributes

Create configuration

$ git config filter.binstore.clean "./put-bin"
$ git config filter.binstore.smudge "./get-bin"

We also need to create the actual filter logic, the put-bin and get-bin files, to handle the binary files on add and checkout. For this example, the implementation is very simple (no error handling, retries, and so on, is implemented).

The clean filter (to store the binaries somewhere else) is a simple bash script that stores the binary it receives on stdin to a directory called binstore at the same level as the Git repository. Git's own hash-object function is used to create a SHA-1 ID for the binary. The ID is used as filename for the binary in the binstore folder and is written as the output of the filer, as the content of the binary file when stored in Git.

The filter logic for the put-bin file can be created as follows:

#!/bin/bash
dest=$(git rev-parse --show-toplevel)/../binstore
mkdir -p $dest
tmpfile=$(git rev-parse --show-toplevel)/tmp
cat > $tmpfile
sha=$(git hash-object --no-filters $tmpfile)
mv $tmpfile $dest/$sha
echo $sha

The smudge filter fetches the binaries from the binstore storage on the same level as the Git repository. The content of the file stored in Git, the SHA-1 ID, is received on stdin and is used to output the content of the file by that name in the binstore folder:

The filter logic for the get-bin file can be created as follows:

#!/bin/bash
source=$(git rev-parse --show-toplevel)/../binstore
tmpfile=$(git rev-parse --show-toplevel)/tmp
cat > $tmpfile
sha=$(cat $tmpfile)
cat $source/$sha
rm $tmpfile

Create these two files and put them in the root of the Git repository.

Now, we are ready to add a JPG image to our repository and see that it is stored somewhere else. We can use the hello_world.jpg image from the exif branch. We can create the file here by querying Git. Find the SHA-1 ID of hello_world.jpg at the tip of the exif branch:

$ git ls-tree --abbrev exif | grep hello_world
100644 blob 5aac2df  hello_world.jpg

Create the file by reading the content from Git to a new file:

$ git cat-file -p 5aac2df > hello_world.jpg

Now, we can add the file, commit the file, and check the external storage, which is placed relative to the current repository at ../binstore, and see the commit content:

Add hello_world.jpg using the following command:

$ git add hello_world.jpg

Commit the contents of the staging area, the hello_world.jpg file:

$ git commit -m 'Added binary'
[no_binaries 19e359d] Added binary
 1 file changed, 1 insertion(+)
 create mode 100644 hello_world.jpg

Check the content of the binstore directory:

$ ls -l ../binstore
total 536
-rw-r--r--  1 aske  staff  272509 May  3 23:24 5aac2dff477eebb3da3cb68843b5cc39745d6447

Finally, we can check the content of the commit with the -p option to display the patch of the commit:

$ git log -1 -p
commit 19e359d774c880fa4f37a3f41a874ba632a31c65
Author: Aske Olsson <[email protected]>
Date:   Sat May 3 22:56:46 2014 +0200

    Added binary

diff --git a/hello_world.jpg b/hello_world.jpg
new file mode 100644
index 0000000..19680e5
--- /dev/null
+++ b/hello_world.jpg
@@ -0,0 +1 @@
+5aac2dff477eebb3da3cb68843b5cc39745d6447

hello_world.jpg is a new file with 5aac2dff477eebb3da3cb68843b5cc39745d6447 content that is similar, as expected, to the name of the file in the binstore directory.

How it works…

Each time a .jpg file is added, the put-bin filter runs. The filter receives the content of the added file on stdin, and it has to output the result of the filter (what needs to go into Git) on stdout. The following is the filter explained in detail:

dest=$(git rev-parse --show-toplevel)/../binstore
mkdir -p "$dest"

The previous two lines create the binstore directory if it doesn't exist. The directory is created at the same level as the Git repository:

tmpfile=$(git rev-parse --show-toplevel)/tmp
cat > $tmpfile

The tmpfile variable is just a path to a temporary file, tmp, located in the root of the repository. The input received on stdin is written to this file.

sha=$(git hash-object --no-filters $tmpfile)
mv $tmpfile $dest/$sha

The previous lines use Git's hashing function to generate a hash for the content of the binary file. We'll use the hash of the file as an identifier when we move it to the binstore folder where the SHA-1 will function as the filename of the binary.

echo $sha

Finally, we output the hash of the file to stdout, and this will be what Git stores as the content of the file in the Git database.

The smudge filter to populate our working tree with the correct file contents also receives the content (from Git) on stdin. The filter needs to find the file in the binstore directory and write the content to stdout for Git to pick it up as the smudged file.

src=$(git rev-parse --show-toplevel)/../binstore
tmpfile=$(git rev-parse --show-toplevel)/tmp
cat > $tmpfile

The first three lines define the path to the binstore folder and a temporary file to which the content received from Git is written.

sha=$(cat $tmpfile)

The hash of the file we need to get is extracted in the previous line.

cat $src/$sha
rm $tmpfile

Finally, we can output the real contents of the file to stdout and remove the temporary file.

There's more…

The previous filters work transparently with Git on add and checkout, but there are some caveats when using Git attributes, and especially filters like the previous ones, which are:

  • Even though the .gitattributes file can be added and distributed inside the repository, the configuration of the filters can't. The configuration of the filters was the first step of the example, which tells Git which command to run for clean and smudge when the filter is used:
    $ git config filter.binstore.clean "./put-bin"
    $ git config filter.binstore.smudge "./get-bin"
    
  • The configuration can be either local to the repository, global for the user, or global for the system, as we saw in Chapter 2, Configuration. However, none of these configurations can be distributed along with the repository, so it is very important that the configuration is set up just after clone. Otherwise, the risk of adding a file without running through the filters is too high.
  • In this example, the storage location of the binaries is just a local directory next to the repository. A better way of doing this could be to copy the binaries to a central storage location either with, for example, scp or through a web service. This, however, limits the user from adding and committing when offline as the binaries cannot be stored in the central repository. A solution to this could be a pre-push hook that could transfer all the binaries to a binary database before a push happens.
  • Finally, there is no error handling in the previous two filters. If one of them fails, it might make sense to abort the add or checkout and warn the user.

See also

There are also other ways of handling binaries in a repository that might be worth considering. These usually introduce extra commands to add and retrieve the binaries. The following are the examples of binary handlers:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset