Though binaries can't easily be diffed, there is nothing to prevent them from being stored in a Git repository, and there are no issues in doing so. However, if one or more binaries in a repository are updated frequently, it can cause the repository to grow quickly in size, making clones and updates slow as a lot of data needs to be transferred. By using the clean and smudge filters for the binaries, it is possible to move them to another location while adding them to Git and fetch them from that location while checking out the specific version of the file.
We'll use the same repositories as in the previous example, but the no_binaries
branch:
$ git clone https://github.com/dvaske/attributes_example.git $ cd attributes_example $ git checkout no_binaries
First, we need to set up the clean and smudge filters for the files. Then, we are only going to run the filter on jpg
files in this example, so let's set it up and create the configuration:
$ echo '*.jpg filter=binstore' > .gitattributes
Create configuration
$ git config filter.binstore.clean "./put-bin" $ git config filter.binstore.smudge "./get-bin"
We also need to create the actual filter logic, the put-bin
and get-bin
files, to handle the binary files on add
and checkout
. For this example, the implementation is very simple (no error handling, retries, and so on, is implemented).
The clean filter (to store the binaries somewhere else) is a simple bash script that stores the binary it receives on stdin
to a directory called binstore
at the same level as the Git repository. Git's own hash-object
function is used to create a SHA-1 ID for the binary. The ID is used as filename for the binary in the binstore
folder and is written as the output of the filer, as the content of the binary file when stored in Git.
The filter logic for the put-bin
file can be created as follows:
#!/bin/bash dest=$(git rev-parse --show-toplevel)/../binstore mkdir -p $dest tmpfile=$(git rev-parse --show-toplevel)/tmp cat > $tmpfile sha=$(git hash-object --no-filters $tmpfile) mv $tmpfile $dest/$sha echo $sha
The smudge filter fetches the binaries from the binstore
storage on the same level as the Git repository. The content of the file stored in Git, the SHA-1 ID, is received on stdin
and is used to output the content of the file by that name in the binstore
folder:
The filter logic for the get-bin
file can be created as follows:
#!/bin/bash source=$(git rev-parse --show-toplevel)/../binstore tmpfile=$(git rev-parse --show-toplevel)/tmp cat > $tmpfile sha=$(cat $tmpfile) cat $source/$sha rm $tmpfile
Create these two files and put them in the root of the Git repository.
Now, we are ready to add a JPG image to our repository and see that it is stored somewhere else. We can use the hello_world.jpg
image from the exif
branch. We can create the file here by querying Git. Find the SHA-1
ID of hello_world.jpg
at the tip of the exif
branch:
$ git ls-tree --abbrev exif | grep hello_world 100644 blob 5aac2df hello_world.jpg
Create the file by reading the content from Git to a new file:
$ git cat-file -p 5aac2df > hello_world.jpg
Now, we can add the file, commit the file, and check the external storage, which is placed relative to the current repository at ../binstore
, and see the commit content:
Add hello_world.jpg
using the following command:
$ git add hello_world.jpg
Commit the contents of the staging area, the hello_world.jpg
file:
$ git commit -m 'Added binary' [no_binaries 19e359d] Added binary 1 file changed, 1 insertion(+) create mode 100644 hello_world.jpg
Check the content of the binstore
directory:
$ ls -l ../binstore total 536 -rw-r--r-- 1 aske staff 272509 May 3 23:24 5aac2dff477eebb3da3cb68843b5cc39745d6447
Finally, we can check the content of the commit with the -p
option to display the patch of the commit:
$ git log -1 -p commit 19e359d774c880fa4f37a3f41a874ba632a31c65 Author: Aske Olsson <[email protected]> Date: Sat May 3 22:56:46 2014 +0200 Added binary diff --git a/hello_world.jpg b/hello_world.jpg new file mode 100644 index 0000000..19680e5 --- /dev/null +++ b/hello_world.jpg @@ -0,0 +1 @@ +5aac2dff477eebb3da3cb68843b5cc39745d6447
hello_world.jpg
is a new file with 5aac2dff477eebb3da3cb68843b5cc39745d6447
content that is similar, as expected, to the name of the file in the binstore
directory.
Each time a .jpg
file is added, the put-bin
filter runs. The filter receives the content of the added file on stdin
, and it has to output the result of the filter (what needs to go into Git) on stdout
. The following is the filter explained in detail:
dest=$(git rev-parse --show-toplevel)/../binstore mkdir -p "$dest"
The previous two lines create the binstore
directory if it doesn't exist. The directory is created at the same level as the Git repository:
tmpfile=$(git rev-parse --show-toplevel)/tmp cat > $tmpfile
The tmpfile
variable is just a path to a temporary file, tmp
, located in the root of the repository. The input received on stdin
is written to this file.
sha=$(git hash-object --no-filters $tmpfile) mv $tmpfile $dest/$sha
The previous lines use Git's hashing function to generate a hash for the content of the binary file. We'll use the hash of the file as an identifier when we move it to the binstore
folder where the SHA-1 will function as the filename of the binary.
echo $sha
Finally, we output the hash of the file to stdout
, and this will be what Git stores as the content of the file in the Git database.
The smudge filter to populate our working tree with the correct file contents also receives the content (from Git) on stdin
. The filter needs to find the file in the binstore
directory and write the content to stdout
for Git to pick it up as the smudged file.
src=$(git rev-parse --show-toplevel)/../binstore tmpfile=$(git rev-parse --show-toplevel)/tmp cat > $tmpfile
The first three lines define the path to the binstore
folder and a temporary file to which the content received from Git is written.
sha=$(cat $tmpfile)
The hash of the file we need to get is extracted in the previous line.
cat $src/$sha rm $tmpfile
Finally, we can output the real contents of the file to stdout
and remove the temporary file.
The previous filters work transparently with Git on add
and checkout
, but there are some caveats when using Git attributes, and especially filters like the previous ones, which are:
.gitattributes
file can be added and distributed inside the repository, the configuration of the filters can't. The configuration of the filters was the first step of the example, which tells Git which command to run for clean and smudge when the filter is used:$ git config filter.binstore.clean "./put-bin" $ git config filter.binstore.smudge "./get-bin"
scp
or through a web service. This, however, limits the user from adding and committing when offline as the binaries cannot be stored in the central repository. A solution to this could be a pre-push
hook that could transfer all the binaries to a binary database before a push happens.add
or checkout
and warn the user.There are also other ways of handling binaries in a repository that might be worth considering. These usually introduce extra commands to add and retrieve the binaries. The following are the examples of binary handlers: