Chapter 9. Managing Subprojects – Building a Living Framework

In Chapter 5, Collaborative Development with Git, you have learned how to manage multiple repositories, while Chapter 6, Advanced Branching Techniques, taught us various development techniques utilizing multiple branches and multiple lines of development in these repositories. Up till now, these multiple repositories were all repositories of a single project. Different projects were all being developed independent of each other. Repositories of the different projects were autonomous.

This chapter will explain and show different ways to connect different subprojects in the one single repository of the framework project, from the strong inclusion by embedding the code of one project in the other (subtrees), to the light connection between projects by nesting repositories (submodules). You will learn how to add a subproject to a master project, how to update the superproject state, and how to update a subproject. We will find out how to send our changes upstream, backporting them into the appropriate project , and pushing to appropriate repository. Different techniques of managing subprojects have different advantages and drawbacks here.

Submodules are sometimes used to manage large size assets. This chapter would also present alternate solutions to the problem of handling large binary files, and other large assets in Git.

In this chapter, we will cover the following topics:

  • Managing library and framework dependencies
  • Dependency management tools—managing dependencies outside Git
  • Importing code into a superproject as a subtree
  • Using subtree merges; the git-subtree and git-stree tools
  • Nested repositories: a subproject inside a superproject
  • Internals of submodules: gitlinks, .gitmodules, the .git file
  • Use cases for subtrees and submodules, comparison of approaches
  • Alternative third-party solutions and tools/helpers
  • Git and large files

Managing library and framework dependencies

There are various reasons to join an external project to your own project. Because there are different reasons to include a project (let's call it a subproject, or a module) inside another project (let's call it superproject, or a master project, or a container), there are different types of inclusions geared towards different circumstances. They all have their advantages and disadvantages, and it is important to understand these to be able choose the correct solution for your problem.

Let's assume that you work on a web application, and that your webapp uses JavaScript (for example, for AJAX, as single-page app perhaps). To make it easier to develop, you probably use some JavaScript library or a web framework, such as jQuery.

Such a library is a separate project. You would want to be able to pin it to a known working version (to avoid problems where future changes to the library would make it stop working for your project), while also being able to review changes and automatically update it to the new version. Perhaps, you would want to make your own changes to the library, and send the proposed changes to the upstream (of course, you would want for users of your project to be able to use the library with your out-of-tree fixes, even if they are not yet accepted by original developers). Conceivably, you might have customizations and changes that you don't want to publish (send to the upstream), but you might still make them available.

This is all possible in Git. There are two main solutions for including subprojects: importing code into your project with the subtree merge strategy and linking subprojects with submodules.

Both submodules and subtrees aim to reuse the code from another project, which usually has its own repository, putting it somewhere inside your own repository's working directory tree. The goal is usually to benefit from the central maintenance of the reused code across a number of container repositories, without having to resort to clumsy, unreliable manual maintenance (usually by copy-pasting).

Sometimes, it is more complicated. The typical situation in many companies is that they use many in-house produced applications, which depend on the common utility library or on a set of libraries. You would usually want to develop each of such applications separately, use it together with others, branch and merge, and apply your own changes and customizations, all in a separate Git repository. Though there are cases for having a single monolithic repository, such as simplified organizations, dependencies, cross-project changes, and tooling if you can get away with it.

But this division, one Git repository for one application, is not without problems. What to do with the common library? Each application uses some specific version of the library and you need to supervise which one. If the library gets improved, you need to test whether this new version correctly works with your code and doesn't crash your application. But the common library is not usually developed as a standalone; its development is driven by the needs of projects that use it. Developers improve it to enhance it with new features needed for their applications. At some point of time, they would want to send their changes to the library itself to share their changes with other developers, if only to share the burden of maintaining these features (the out-of-tree patches bring maintenance costs to keep them current).

What to do then? This chapter describes a few strategies used to manage subprojects. For each technique, we will detail how to add such subprojects to superprojects, how to keep them up to date, how to create your own changes, and how to publish selected changes upstream.

Note

Note that all the solutions require that all the files of a subproject are contained in a single subdirectory of a superproject. No currently available solution allows you to mix the subproject files, with other files or have them occupy more than one directory.

However you manage subprojects, be it subtrees, submodules, third-party tools or dependency management outside Git, you should strive for the module code to remain independent of the particularities of the superproject (or at least, handle such particularities using an external, possibly nonversioned configuration). Using superproject-specific modifications goes against modularization and encapsulation principles, unnecessarily coupling the two projects.

Managing dependencies outside Git

In many cases, the technological context (the development stack used) allows to use for packaging and formal dependency management. If it is possible, it is usually preferable to go this route. It lets you split your codebase better and avoid a number of side effects, complications, and pitfalls that litter the submodule and subtree solution space (with different complications for different techniques). It removes the version control systems from the managing modules. It also lets you benefit from versioning schemes, such as semantic versioning (http://semver.org/), for your dependencies.

As a reminder, here's a partial list (in the alphabetical order) of the main languages and development stacks, and their dependency management/packaging systems and registries (see the full comparison at http://www.modulecounts.com/):

  • Clojure has Clojars
  • Go has GoDoc
  • Haskell has Hackage (registry) and cabal (application)
  • Java has Maven Central (Maven and Gradle)
  • JavaScript has npm (for Node.js) and Bower
  • .NET has NuGet
  • Objective-C has CocoaPods
  • Perl has CPAN (Comprehensive Perl Archive Network) and carton
  • PHP has Composer, Packagist, and good old PEAR and PECL
  • Python has PyPI (Python Package Index) and pip
  • Ruby has Bundler and RubyGems
  • Rust has Crates

Sometimes, these are not enough. You might need to apply some out-of-tree patches (changes) to customize the module (subproject) for your needs. But for some reason, you are unable to publish these changes upstream, to have them accepted. Perhaps, the changes are relevant only to your specific project, or the upstream is slow to respond to the proposed changes, or perhaps there are license considerations Maybe the subproject in question is a in-house module that cannot be made public and which you are required to use for your company projects.

In all these cases, you need for the custom package registry (the package repository) to be used in addition to the default one , or you need to make subprojects be managed as private packages, which these systems often allow. If there is no support for private packages, a tool to manage the private registry, such as Pinto or CPAN::Mini for Perl, would be also needed.

Manually importing the code into your project

Let's take a look at one of the possibilities: why don't we simply import the library into some subdirectory in our project? If you need to bring it up to date, you would just copy the new version as a new set of files. In this approach, the subproject code is embedded inside the code of the superproject.

The simplest solution would be to just overwrite the contents of the subproject's directory each time we want to update the superproject to use the new version. If the project you want to import doesn't use Git, or if it doesn't use a version control system at all, or if the repository it uses is not public, this will indeed be the only possible solution.

Tip

Using repositories from a foreign VCS as a remote

If the project you want to import (to embed) uses a version control system other than Git, but there is a good conversion mechanism (for example, with a fast-import stream), you can use remote helpers to set up a foreign VCS repository as a remote repository (via automatic conversion). You can check Chapter 5, Collaborative Development with Git, and Chapter 10, Customizing and Extending Git for more information.

This can be done, for example, with the Mercurial and Bazaar repositories, thanks to the git-remote-hg and git-remote-bzr helpers.

Moving to the new version of the imported library is quite simple (and the mechanism easy to understand). Remove all the files from the directory, add files from the new version of the library, for example by extracting them from the archive, then use git add command to the directory:

$ rm -rf mylib/
$ git rm mylib
$ tar -xzf /tmp/mylib-0.5.tar.gz
$ mv mylib-0.5 mylib
$ git add mylib
$ git commit

This method works quite well in simple cases with the following caveats:

  • In Git, in the history of your project, you have only the versions of the library at the time of imports. On the one hand, this makes your project history clean and easy to understand, on the other hand, you don't have access to the fine-grained history of a subproject. For example, when using git bisect, you would be able only find that it was introduced by upgrading the library, but not the exact commit in the history of the library that introduced the bug in question.
  • If you want to customize the code of the library, fitting it to your project by adding the changes dependent on your application, you would need to reapply those customization in some way after you import a new version. You could extract your changes with git diff (comparing it to the unchanged version at the time of import) and then use git apply after upgrading the library. Or, you could use a rebase, an interactive rebase, or some patch management interface; see Chapter 8, Keeping History Clean. Git won't do this automatically.
  • Each importing of the new version of the library requires running a specific sequence of commands to update superproject: removing the old version of files, adding new ones, and committing the change. It is not as easy as running git pull, though you can use scripts or aliases to help.

A Git subtree for embedding the subproject code

In a slightly more advanced solution, you use the subtree merge to join the history of a subproject to the history of a superproject. This is only somewhat more complicated than an ordinary pull (at least, after the subproject is imported), but provides a way to automatically merge changes together.

Depending on your requirements, this method might fit well with your needs. It has the following advantages:

  • You would always have the correct version of the library, never using the wrong library version by an accident
  • The method is simple to explain and understand, using only the standard (and well-known) Git features. As you will see, the most important and most commonly used operations are easy to do and easy to understand, and it is hard to go wrong.
  • The repository of your application is always self-contained; therefore, cloning it (with plain old git clone) will always include everything that's needed. This means that this method is a good fit for the required dependencies.
  • It is easy to apply patches (for example, customizations) to the library inside your repository, even if you don't have the commit rights to the upstream repository.
  • Creating a new branch in your application also creates a new branch for the library; it is the same for switching branches. That's the behavior you expect. This is contrasted with the submodule's behavior (the other technique for managing subprojects).
  • If you are using the subtree merge strategy (described shortly in Chapter 7, Merging Changes Together), for example with git pull -s subtree, then getting a new library version will be as easy as updating all the other parts of your project.

Unfortunately however, this technique is not without its disadvantages. For many people and for many projects, these disadvantages do not matter. The simplicity of the subtree-based method usually prevails over its faults.

Here are the problems with the subtree approach:

  • Each application using the library doubles its files. There is no easy and safe way to share its objects among different projects and different repositories. Though see the following about the possibility of sharing Git object database.
  • Each application using the library has its files checked out in the working area, though you can change it with the help of the sparse checkout (described later in the chapter).
  • If your application introduces changes to its copy of the library, it is not that easy to publish these changes and send them upstream. Third-party tools such as git subtree or git stree can help here. They have specialized subcommands to extract the subproject's changes.
  • Because of the lack of separation between the subproject files and the superproject files, it is quite easy to mix the changes to the library and the changes to the application in one commit. In such cases, you might need to rewrite the history (or the copy of a history), as described in Chapter 8, Keeping History Clean.

The first two issues mean that subtrees are not a good fit to manage the subprojects that are optional dependencies (needed only for some extra features) or optional components (such as themes, extensions, or plugins), especially those that are installed by a mere presence in the appropriate place in the filesystem hierarchy.

Tip

Sharing objects between forks (copies) with alternates

You can mitigate the duplication of objects in the repository with alternates or, in other words, with git clone --reference. However, then you would need to take greater care about garbage collection. The problematic parts are those parts of the history that are referenced in the borrower repository (that is, one with alternates set up), but are not referenced in the lender reference's repository. The description and explanation of the alternative mechanisms will be presented in Chapter 11, Git Administration.

There are different technical ways to handle and manage the subtree-imported subprojects. You can use classic Git commands, just using the appropriate options while affecting the subproject, such as --strategy=subtree (or the subtree option to the default recursive merge strategy, --strategy-option=subtree=<path>) for merge, cherry-pick, and related operations. This manual approach works everywhere, is actually quite simple in most cases, and offers the best degree of control over operations. It requires, however, a good understanding of the underlying concepts.

In modern Git (since version 1.7.11), there is the git subtree command available among installed binaries. It comes from the contrib/ area and is not fully integrated (for example, with respect to its documentation). This script is well tested and robust, but some of its notions are rather peculiar or confusing , and this command does not support the whole range of possible subtree operations. Additionally, this tool supports only the import with history workflow (which will be defined later), which some say clutters the history graph.

There are also other third-party scripts that help with subtrees; among them is git-stree.

Creating a remote for a subproject

Usually, while importing a subproject, you would want to be able to update the embedded files easily. You would want to continue interacting with the subproject. For this, you would add that subproject (for example, the common library) as a remote reference in your own (super) project and fetch it:

$ git remote add mylib_repo https://git.example.com/mylib.git
$ git fetch mylib_repo
warning: no common commits
remote: Counting objects: 12, done.
remote: Total 12 (delta 0), reused 0 (delta 0)
Unpacking objects: 100% (12/12), done.
From https://git.example.com/mylib.git
* [new branch]      master     -> mylib_repo/master

You can then examine the mylib_repo/master remote-tracking branch, which can be done either by checking it out into the detached HEAD with git checkout mylib_repo/master, or by creating a local branch out of it and checking this local branch out with git checkout -b mylib_branch mylib_repo/master. Alternatively, you can just list its files with git ls-tree -r --abbrev mylib_repo/master. You will see then that the subproject has a different project root from your superproject. Additionally, as seen from the warning: no common commits message, this remote-tracking branch contains a completely different history coming from a separate project.

Adding a subproject as a subtree

If you are not using specialized tools like git subtree but a manual approach, the next step will be a bit complicated and will require you to use some advanced Git concepts and techniques. Fortunately, it needs to be done only once.

First, if you want to import the subproject history, you would need to create a merge commit that will import the subproject in question. You need to have the files of the subproject in the given directory in a superproject. Unfortunately, at least, with the current version of Git as of writing this chapter, using the -Xsubtree=mylib/ merge strategy option would not work as expected. We would have to do it in two steps: prepare the parents and then prepare the contents.

The first step would then be to prepare a merge commit using the ours merge strategy, but without creating it (writing it to the repository). This strategy joins histories, but takes the current version of the files from the current branch:

$ git merge --no-commit --strategy=ours mylib_repo/master
Automatic merge went well; stopped before committing as requested

If you want to have simple history, similar to the one we get from just copying files, you can skip this step.

We now need to update our index (the staging area for the commits) with the contents of the master branch from the library repository, and update our working directory with it. All this needs to happen in the proper subfolder too. This can be done with the low-level (plumbing) git read-tree command:

$ git read-tree --prefix=mylib/ -u mylib_repo/master
$ git status
On branch master
All conflicts fixed but you are still merging.
  (use "git commit" to conclude merge)

Changes to be committed:

        new file:   mylib/README
        [...]

We have used the -u option, so the working directory is updated along with the index.

Note

It is important to not forget the trailing slash in the argument of the --prefix option. Checked out files are literally prefixed with it.

This set of steps is described in the HOWTO section of the Git documentation, namely in the How to use the subtree merge strategy moved earlier https://www.kernel.org/pub/software/scm/git/docs/howto/using-merge-subtree.html.

It is much easier to use tools such as git subtree:

$ git subtree add --prefix=mylib mylib_repo master
git fetch mylib_repo master
Added dir 'mylib'

The git subtree command would fetch the subtree's remote when necessary; there's no need for the manual fetch that you had to perform in the manual solution.

If you examine the history, for example, with git log --oneline --graph --decorate, you will see that this command merged the library's history with the history of the application (of the superproject). If you don't want this, tough luck. The --squash option that git subtree offers on its add, pull, and merge subcommands won't help here. One of the peculiarities of this tool is that this option doesn't create a squash merge, but simply merges the squashed subproject's history (as if it were squashed with an interactive rebase). See, Fig 2 later in the chapter.

If you want a subtree without its history attached to the superproject history, consider using git-stree. It has the additional advantage that it remembers the subtree settings and that it would create a remote if necessary:

$ git stree add mylib_repo -P mylib 
  https://git.example.com/mylib.git master
warning: no common commits
[master 5e28a71] [STree] Added stree 'mylib_repo' in mylib
 5 files changed, 32 insertions(+)
 create mode 100644 mylib/README
[...]

  STree 'mylib_repo' configured, 1st injection committed.

The information about the subtree's prefix (subdirectory), the branch, and so on is stored in the local configuration in the stree.<name> group This stays in contrast to the behavior of git subtree, where you need to provide the prefix argument on each command.

Cloning and updating superprojects with subtrees

All right! Now that we have our project with a library embedded as a subtree, what do we need to do to get it? Because the concept behind subtrees is to have just one repository: the container, you can simply clone this repository.

To get an up-to-date repository you just need a regular pull; this would bring both superproject (the container) and subproject (the library) up to date. This works regardless of the approach taken, the tool used, and the manner in which the subtree was added. It is a great advantage of the subtrees approach.

Getting updates from subprojects with a subtree merge

Let's see what happens if there are some new changes in the subproject since we imported it. It is easy to bring the version embedded in the superproject up to date:

$ git pull --strategy subtree mylib_repo master
From https://git.example.com/mylib.git
 * branch            master     -> FETCH_HEAD
Merge made by the 'subtree' strategy.

You could have fetched and then merged instead, which allows for greater control. Or, you could have rebased instead of merging, if you prefer; that works too.

Note

Don't forget to select the merge strategy with -s subtree while pulling a subproject. Merging could work even without it, because Git does rename detection and would usually be able to discover that the files were moved from the root directory (in the subproject) to a subdirectory (in the superproject we are merging into). The problematic case is when there are conflicting files inside and outside of the subproject. Potential candidates are Makefiles and other standard filenames.

If there are some problems with Git detecting the correct directory to merge into, or if you need advanced features of an ordinary recursive merge strategy (which is the default), you can instead use -Xsubtree=<path/to/subproject>, the subtree option of the recursive merge strategy.

You may need to adjust other parts of the application code to work properly with the updated code of the library.

Note that, with this solution, you have a subproject history attached to your application history, as you can see in Fig 1:

Getting updates from subprojects with a subtree merge

Fig 1: History of a superproject with a subtree-merged subproject

If you don't want to have the history of a subproject entangled in the history of a master project, and prefer a simple-looking history (as shown on Fig. 2), you can use the --squash option of git merge (or git pull) command to squash it.

$ git merge -s subtree --squash mylib_repo/master
Squash commit -- not updating HEAD
Automatic merge went well; stopped before committing as requested
$ git commit -m "Updated the library"

In this case, you would have in the history only the fact that the version of the subproject had changed, which has its advantages and disadvantages. You get simpler history, but also simplified history.

With the git subtree or git stree tools, it is enough to use their pull subcommand; they supply the subtree merge strategy themselves. However, currently git subtree pull requires you to respecify --prefix and the entire subtree settings.

Getting updates from subprojects with a subtree merge

Fig 2: Different types of subtree merges: (a) subtree merge: git pull -s subtree and git subtree pull, (b) subtree merge of squashed commits: git subtree pull --squash, (c) squashed subtree merge: git pull -s subtree --squash and git stree. Note that dotted line in (c) denotes how commits C2 and C4 were made, and not that it is parent commit.

Note that the git subtree command always merges, even with the --squash option; it simply squashes the subproject commits before merging (such as the squash instruction in the interactive rebase). In turn, git stree pull always squashes the merge (such as git merge --squash), which keeps the superproject history and subproject history separated without polluting the graph of the history. All this can be seen in Fig 2.

Showing changes between a subtree and its upstream

To find out the differences between the subproject and the current version in the working director, you need nontypical selector syntax for git diff. This is because all the files in the subproject (for example, in the mylib_repo/master remote-tracking branch) are in the root directory, while they are in the mylib/ directory in the superproject (for example, in master). We need to select the subdirectory to be compared with master, putting it after the revision identifier and the colon (skipping it would mean that it would be compared with the root directory of the superproject).

The command looks as follows:

$ git diff master:mylib mylib_repo/master

Similarly, to check after the subtree merge whether the commit we just created (HEAD) has the same contents in the mylib/ directory as the merged in commit, that is, HEAD^2, we can use:

$ git diff HEAD:mylib HEAD^2

Sending changes to the upstream of a subtree

In some cases, the subtree code of a subproject can only be used or tested inside the container code; most themes and plugins have such constraints. In this situation, you'll be forced to evolve your subtree code straight inside the master project code base, before you finally backport it to the subproject upstream.

These changes often require adjustments in the rest of the superproject code; though it is recommended to make two separate commits (one for the subtree code change and one for the rest), it is not strictly necessary. You can tell Git to extract only the subproject changes. The problem is with the commit messages of the split changes, as Git is not able to automatically extract relevant parts of the changeset description.

Another common occurrence, which is best avoided but is sometimes necessary, is the need to customize the subproject's code in a container-specific way (configure it specifically for a master project), usually without pushing these changes back upstream. You should carefully distinguish between both the situations, keeping each use case's changes (backportable and nonbackportable) in their own commits.

There are different ways to deal with this issue. You can avoid the problem of extracting changes to be sent upstream by requiring that all the subtree changes have to be done in a separate module-only repository. If it is possible, we can even require that all the subproject changes have to be sent upstream first, and we can get the changes into the container only through upstream acceptance.

If you need to be able to extract the subtree changes, then one possible solution is to utilize git filter-branch --directory-filter (or --index-filter with the appropriate script). Another simple solution is to just use git subtree push. Both the methods, however, backport every commit that touches the subtree in question.

If you want to send upstream only a selection of the changes to the subproject of those that made it into the master project repository, then the solution is a bit more complicated. One possibility is to create a local branch meant specifically for backporting out of the subproject remote-tracking branch. Forking it from said subtree-tracking branch means that it has the subtree as the root and it would include only the submodule files.

This branch intended for backporting changes to the subproject would need to have the appropriate branch in the remote of the subproject upstream repository as its upstream branch. With such setup, we would then be able to git cherry-pick --strategy=subtree the commits we're interested in sending to the subproject's upstream onto it. Then, we can simply git push this branch into the subproject's repository.

Note

It is prudent to specify --strategy=subtree even if cherry-pick would work without it, to make sure that the files outside the subproject's directory (outside subtree) will get quietly ignored. This can be used to extract the subtree changes from the mixed commit; without this option, Git will refuse to complete the cherry-pick.

This requires much more steps than ordinary git push. Fortunately, you need to face this problem only while sending the changes made in the superproject repository back to the subproject. As you have seen, fetching changes from the subproject into the superproject is much, much simpler.

Well, it using git-stree would make this trivial: you just need to list the commits to be pushed to backport:

$ git stree push mylib_repo master~3 master~1
  5e28a71 [To backport] Support for creating debug symbols
  5b0aa4b [To backport] Timestamping (requires application tweaks)
  STree 'mylib_repo' successfully backported local changes to its remote

In fact, this tool uses internally the same technique, creating and using a backport-specific local branch for the subproject.

The Git submodules solution: repository inside repository

The subtrees method of importing the code (and possibly also history) of a subproject into the superproject has its disadvantages. In many cases, the subproject and the container are two different projects: your application depends on the library, but it is obvious that they are separate entities. Joining the histories of the two doesn't look like the best solution.

Additionally, the embedded code and imported history of a subproject is always here. Therefore, the subtrees technique is not a good fit for optional dependencies and components (such as plugins or themes). It also doesn't allow you to have different access control for the subproject's history, with the possible exception of restricting write access to the subproject (actually to the subdirectory of a subproject), by using Git repository management solutions such as gitolite (you can find more in Chapter 11, Git Administration).

The submodule solution is to keep the subproject code and history in its own repository and to embed this repository inside the working area of a superproject, but not to add its files as superproject files.

Gitlinks, .git files, and the git submodule command

Git includes the command named git submodule, which is intended to work with submodules. Unfortunately, using this tool is not easy. To utilize it correctly, you need to understand at least some of the details of its operation. It is a combination of two distinct features: the so-called gitlinks and the git submodule tool itself.

Both in the subtree solution and the submodule solution, subprojects need to be contained in their own folder inside the working directory of the superproject. But while with subtrees the code of the subproject belongs to superproject repository, it is not the case for submodules. With submodules, each subproject has instead its own repository somewhere inside the working directory of its container repository. The code of the submodule belongs to its repository, and the superproject itself simply stores meta-information required to get appropriate revision of the subproject files.

In practice, in modern Git, submodules use a simple .git file with a single gitdir: line containing a relative path to the actual repository folder. The submodule repository is actually located inside superproject's .git/modules folder (and has core.worktree set up appropriately). This is done mostly to handle the case when the superproject has branches that don't have submodule at all. It allows to avoid having to scrap the submodule's repository while switching to the superproject revision without it.

Note

You can think of the .git file with gitdir: line as a symbolic reference equivalent for the .git directories, an OS-independent symbolic link replacement. The path to the repository doesn't need to be a relative path.

$ ls -aloF plugins/demo/
total 10
drwxr-xr-x 1 user  0 Jul 13 01:26 ./
drwxr-xr-x 1 user  0 Jul 13 01:26 ../
-rw-r--r-- 1 user 32 Jul 13 01:26 .git
-rw-r--r-- 1 user  9 Jul 13 01:26 README
[…]
$ cat plugins/demo/.git
gitdir: ../../.git/modules/plugins/demo

Be that as it may, the contained superproject and the subproject module truly act as and, in fact, are independent repositories: they have their own history, their own staging area, and their own current branch. You should, therefore, take care while typing commands, minding if you're inside the submodule or outside it, because the context and impact of your commands differ drastically!

The main idea behind submodules is that the superproject commit remembers the exact revision of the subproject; this reference uses the SHA1 identifier of subproject commit. Instead of using a manifest-like file like in some dependency management tools, submodules solution stores this information in a tree object using the so-called gitlinks. Gitlink is a reference from a tree object (in the superproject repository) to a commit object (usually, in the submodule repository); see Fig 3.

Gitlinks, .git files, and the git submodule command

Fig 3: The history of a superproject with a subproject linked as a submodule . The faint shade of submodule files on left hand side denotes that there are present as files in the working directory of the superproject, but are not in the superproject repository themselves.

Recall that, following the description of the types of objects in the repository database from Chapter 8, Keeping History Clean, each commit object (representing a revision of a project) points exactly to one tree object with the snapshot of the repository contents. Each tree object references blobs and trees, representing file contents and directory contents, respectively. The tree object referenced by the commit object uniquely identifies the set of files contents, file names, and file permissions contained in a revision associated with the commit object.

Let's remember that the commit objects themselves are connected with each other, creating the Directed Acyclic Graph (DAG) of revisions. Each commit object references zero or more parent commits, which together describe the history of a project.

Each type of the references mentioned earlier took part in the reachability check. If the object pointed to was missing, it means that the repository is corrupt.

It is not so for gitlinks. Entries in the tree object pointing to the commits refer to the objects in the other separate repository, namely in the subproject (submodule) repository. The fact that the submodule commit being unreachable is not an error is what allows us to optionally include submodules; no submodule repository, no commit referenced in gitlink.

The results of running git ls-tree --abbrev HEAD on a project with all the types of objects is as follows:

040000 tree 573f464    docs
100755 blob f27adc2    executable.sh
100644 blob 1083735    README.txt
040000 tree ef9bcb4    subdirectory
160000 commit 5b0aa4b   submodule
120000 blob 3295d66    symlink

Compare it with the contents of the working area (with ls -l -o -F):

drwxr-xr-x   5 user    12288 06-28 17:18 docs/
-rwxr-xr-x   1 user    36983 02-20 20:11 executable.sh*
-rw-r--r--   1 user     2628 2015-01-03  README.txt
drwxr-xr-x   3 user     4096 06-28 17:19 subdirectory/
drwxr-xr-x  48 user    36864 06-28 17:19 submodule/
lrwxrwxrwx   1 user       32 06-28 17:18 symlink -> docs/toc.html

Adding a subproject as a submodule

With subtrees, the first step was usually to add a subproject repository as a remote, which meant that objects from the subproject repository were fetched into the superproject object database.

With submodules, the subproject repository is kept separate. You could manage cloning the subproject repository manually from inside the superproject worktree and then add the gitlink also by hand with git add <submodule directory> (without a trailing slash).

Note

Important note!

Normally, commands git add subdir and git add subdir/ (the latter with a forward slash, which following the POSIX standard denotes a subdirectory) are equivalent. This is not true if you want to create gitlink! If subdir is a top directory of an embedded Git repository of a subproject, the former would create a gitlink reference, while the latter in the form of git add subdir/ would add all the files in the subdir individually, which is not probably what you expect.

A simpler and better solution is to use the git submodule command, which was created to help manage the filesystem contents, the metadata, and the configuration of your submodules, as well as inspect their status and update them. To add the given repository as a submodule at a specific directory in the superproject, use the add subcommand of the git submodule:

$ git submodule add https://git.example.com/demo-plugin.git 
  plugins/demo
Cloning into 'plugins/demo'...
done.

Note

Note:

While using paths instead of URLs for remotes, you need to remember that the relative paths for remotes are interpreted relative to our main remote, not to the root directory of our repository.

This command stores the information about the submodule, for example the URL of the repository, in the .gitmodules file. It creates this file if it does not exist:

[submodule "plugins/demo"]
        path = plugins/demo
        url = https://git.example.com/demo-plugin.git

Note that a submodule gets a name equal to its path. You can set the name explicitly with the --name option (or by editing the configuration); git mv on a submodule directory will change the submodule path but keep the same name.

Tip

Reuse of authentication while fetching submodules

While storing the URL of a remote repository, it is often acceptable and useful to store the username with the subproject information (for example, storing the username in a URL, like [email protected]:mylib.git).

However, remembering the username as a part of URL is undesirable in .gitmodules, because this file must be visible by other developers (which often use different usernames for authentication). Fortunately, the commands that descend into submodules can reuse the authentication from cloning (or fetching) a superproject.

The add subcommand also runs an equivalent of git submodule init for you, assuming that if you have added a submodule, you are interested in it. This adds some submodule-specific settings to the local configuration of the master project:

[submodule "plugins/demo"]
        url = https://git.example.com/demo-plugin.git

Why the duplication? Why store the same information in .gitmodules and in .git/config? Well, because while the .gitmodules file is meant for all developers, we can fit our local configuration to specific local circumstances. The other reason for using two different files is that while the presence of the submodule information in .gitmodules means only that the subproject is available, having it also in .git/config implies that we are interested in a given submodule (and that we want it to be present).

You can create and edit the .gitmodules file by hand or with git config -f .gitmodules. This is useful if, for example, you have added a submodule by hand by cloning it, but want to use git submodule from now on.

This file is usually committed to the superproject repository (similar to .gitignore and .gitattributes files), where it serves as the list of possible subprojects.

Note

All the other subcommands require this file to be present; for example, if we would run git submodule update before adding it, we would get:

$ git submodule update
No submodule mapping found in .gitmodules for path 'plugins/demo'

That's why git submodule add stages both the .gitmodules file and the submodule itself:

$ git status
On branch master
Changes to be committed:
  (use "git reset HEAD <file>..." to unstage)

        new file:   .gitmodules
        new file:   plugins/demo

Note that the whole submodule, which is a directory, looks to the git status like the new file. By default, most Git commands are limited to the active container repository only, and do not descent to the nested repositories of the submodules. As we will see, this is configurable.

Cloning superprojects with submodules

One important issue is that, by default, if you clone the superproject repository, you would not get any submodules. All the submodules will be missing from the working duplicated directory; only their base directories are here. This behavior is the basis of the optionality of submodules.

We need then to tell Git that we are interested in a given submodule. This is done by calling the git submodule init command. What this command does is it copies the submodule settings from the .gitmodules file into the superproject's repository configuration, namely, .git/config, registering the submodule:

$ git submodule init plugins/demo
Submodule 'plugins/demo' (https://git.example.com/demo-plugin.git) registered for path 'plugins/demo'

The init subcommand adds the following two lines to the .git/config file:

[submodule "plugins/demo"]
        url = https://git.example.com/demo-plugin.git

This separate local configuration for the submodules you are interested in allows you also to configure your local submodules to point to a different location URL (perhaps, a per-company reference clone of a subproject's repository) than the one that is present in .gitmodules file.

This mechanism also makes it possible to provide the new URL if the repository of a subproject moved. That's why the local configuration overrides the one that is recorded in .gitmodules; otherwise you would not be able to fetch from current URL when switched to the version before the URL change. On the other hand, if the repository moved , and the .gitmodules file was updated accordingly, we can re-extract new URL from .gitmodules into local configuration with git submodule sync.

We have told Git that we are interested in the given submodule. However, we have still not fetched the submodule commits from its remote and neither have we checked it out and have its files present in the working directory of the superproject. We can do this with git submodule update.

Note

In practice, while dealing with submodule using repositories, we usually group the two commands (init and update) into one with git submodule update --init.

well, at least if we don't need to customize the URL.

If you are interested in all the submodules, you can use git clone --recursive to automatically initialize and update each submodule right after cloning.

To temporarily remove a submodule, retaining the possibility of restoring it later, you can mark it as not interesting with git remote deinit. This just affects .git/config. To permanently remove a submodule, you need to first deinit it and then remove it from .gitmodules and from the working area (with git rm).

Updating submodules after superproject changes

To update the submodule so that the working directory contents reflect the state of a submodule in the current version of superproject, you need to perform git submodule update. This command updates the files of the subproject or, if necessary, clones the initial submodule repository:

$ rm -rf plugins/demo   # clean start for this example
$ git submodule update
Submodule path 'plugins/demo': checked out '5e28a713d8e87…'

The git submodule update command goes to the repository referenced by .git/config, fetches the ID of the commit found in the index (git ls-tree HEAD -- plugins/demo), and checks out this version into the directory given by .git/config. You can, of course, specify the submodule you want to update, giving the path to the submodule as a parameter.

Because we are here checking out the revision given by gitlink, and not by a branch, git submodule update detaches the subprojects' HEAD (see Fig 3). This command rewinds the subproject straight to the version recorded in the supermodule.

There are a few more things that you need to know:

  • If you are changing the current revision of a superproject in any way, either by changing a branch, by importing a branch with git pull, or by rewinding the history with git reset, you need to run git submodule update to get the matching content to submodules. This is not done automatically, because it could lead to potentially losing your work in a submodule.
  • Conversely, if you switch to another branch, or otherwise change the current revision in a superproject, and do not run git submodule update, Git would consider that you changed your submodule directory deliberately to point to a new commit (while it is really an old commit, that you used before, but you forgot to update). If, in this situation, you would run git commit -a, then by accident, you will change gitlink, leading to having an incorrect version of a submodule stored in the superproject history.
  • You can upgrade the gitlink reference simply by fetching (or switching to) the version of a submodule you want to have by using ordinary Git commands inside the subproject, and then committing this version in the supermodule. You don't need to use the git submodule command here.

You can have Git to automatically fetch the initialized submodules while pulling the updates from the master project's remote repository. This behavior can be configured using fetch.recurseSubmodules (or submodule.<name>.fetchRecurseSubmodules). The default value for this configuration is on-demand (to fetch if gitlink changes, and the submodule commit it points to is missing). You can set it to yes or no to turn recursively fetching submodules on or off unconditionally. The corresponding command-line option is --recurse-submodules.

It is however critical to remember that even though Git can automatically fetch submodules, it does not auto-update. Your local clone of the submodule repository is up to date with the submodule's remote, but the submodule's working directory is stuck to its former contents. If you don't explicitly update the submodule's working directory, the next commit in the container repository will regress the submodule. Currently, there are no configuration settings or command-line options that can autoupdate all the auto-fetched submodules on pull. Well, there were no such options at the time of this writing, but hopefully the management of submodules in Git will improve.

Note that instead of checking out the gitlinked revision on detached HEAD, we can merge the commit recorded in the superproject into the current branch in the submodule with --merge, or rebase the current branch on top of the gitlink with --rebase, just like with git pull. The submodule repository branch used defaults to master, but the branch name may be overridden by setting the submodule.<name>.branch option in either .gitmodules or .git/config, the latter taking precedence.

As you can see, using gitlinks and the git submodule command is quite complicated. Fundamentally, the concept of gitlink might fit well to the relationship between subprojects and your superproject, but using this information correctly is harder than you think. On the other hand, it gives great flexibility and power.

Examining changes in a submodule

By default, the status, logs, and diff output is based solely on the state of the active repository, and does not descend into submodules. This is often problematic; you would need to remember to run git submodule summary. It is easy to miss a regression if you are limited to this view: you can see that the submodule has changed, but you can't see how.

You can, however, set up Git to make it use a submodule-aware status with the status.submoduleSummary configuration variable. If it is set to a nonzero number, this number will provide the --summary-limit restriction; a value of true or -1 will mean an unlimited number.

After setting this configuration, you would get something like the following redundant:

$ git status
On branch master
Changes to be committed:
  (use "git reset HEAD <file>..." to unstage)

        new file:   .gitmodules
        new file:   plugins/demo

Submodule changes to be committed:

* plugins/demo 0000000...5e28a71 (3):
  > Fix repository name in a README file

The status extends the always present information that the submodule changed (new file: plugins/demo), adding the information that the submodule present at plugins/demo got three new commits, and showing the summary for the last one (Fix repository name in a README file). The right pointing angle bracket > preceding the summary line means that the commit was added, that is, present in the working area but not (yet) in the superproject commit.

Note

Actually, this added part is just the git submodule summary output.

For the submodule in question, a series of commits in the submodule between the submodule version in the given superproject's commit and the submodule version in the index or the working tree (the former shown by using --cached) are listed. There is also git submodule status for short information about each module.

The git diff command's default output also doesn't tell much about the change in the submodule, just that it is different:

$ git diff HEAD -- plugins/demo
diff --git a/plugins/demo b/plugins/demo
new file mode 160000
index 0000000..5e28a71
--- /dev/null
+++ b/plugins/demo
@@ -0,0 +1 @@
+Subproject commit 5e28a713d8e875f2cf1060c2580886dec3e5b04c

Fortunately, there is the --submodule=log command-line option (that you can enable by default with the diff.submodule configuration setting) that lets us see something more useful:

$ git diff HEAD --submodule=log -- plugins/demo
Submodule subrepo 0000000...5e28a71 (new submodule)

Instead of using log, we can use the short format that shows just the names of the commits, which is the default if the format is not given (that is, with just git diff --submodule).

Getting updates from the upstream of the submodule

To remind you, the submodule commits are referenced in gitlinks using the SHA1 identifier, which always resolves to the same revision; it is not a volatile (inconstant) reference such as a branch name. Because of this, a submodule in a superproject does not automatically upgrade (which could possibly be breaking the application). But sometimes you may want to update it.

Let's assume that the subproject repository got new revisions published and we want, for our superproject, to update to the new version of a submodule.

To achieve this, we need to update the local repository of a submodule, move the version we want to the working directory of the superproject, and finally commit the submodule change in the superproject.

We can do this manually, starting by first changing current directory to be inside the working directory of the submodule. Then, inside the submodule, we perform git fetch to get the data to the local clone of the repository (in .git/modules/ in the superproject). After verifying what we have with git log, we can then update the working directory. If there are no local changes, you can simply checkout the desired revision. Finally, you need to create a commit in a superproject.

In addition to the finer-grained control, this approach has the added benefit of working regardless of your current state (whether you are on an active branch or on a detached HEAD).

Another way to go about this would be, working from the container repository, to explicitly upgrade the submodule to its tracked remote branch with git submodule update --remote. Similarly to the ordinary update command, you can choose to merge or rebase instead of checking out a branch; you can configure the default way of updating with the submodule.<name>.update configuration variable, and the default upstream branch with submodule.<name>.branch.

Note

In short, submodule update --remote --merge will merge upstream's subproject changes into the submodule, while submodule update --merge will merge the superproject gitlink changes into the submodule.

The git submodule update --remote command would fetch new changes from the submodule remote site automatically, unless told not to with --no-fetch.

Sending submodule changes upstream

One of the major dangers in making changes live directly in a submodule (and not via its standalone repository) is forgetting to push the submodule. A good practice for submodules is to commit changes to the submodule first, push the module changes, and only then get back to the container project, commit it, and push the container changes.

If you only push to the supermodule repository, forgetting about the submodule push, then other developers would get an error while trying to get the updates. Though Git does not complain while fetching the superproject, you would see the problem in the git submodule summary output (and in the git status output, if properly configured) and while trying to update the working area:

$ git submodule summary
* plugins/demo 12e3a52...0e90143:
  Warn: plugins/demo doesn't contain commit 12e3a529698c519b2fab790…
$ git submodule update
fatal: reference is not a tree: 12e3a529698c519b2fab790…
Unable to checkout '12e3a529698c519b2fab790…' in submodule path 'plugins/demo'

You can plainly see how important it is to remember to push the submodule. You can ask Git to automatically push the submodules while pushing the superproject, if it is necessary, with git push --recurse-submodules=on-demand (the other option is just to check). With Git 2.7.0 or later you can also use the push.recurseSubmodules configuration option.

Transforming a subfolder into a subtree or submodule

The first issue that comes to mind while thinking of the use cases of subprojects in Git is about having source code of the base project be ready for such division. Submodules and subtrees are always expressed as subdirectories of the superproject (the master project). You can't mix files from different subsystems in one directory.

Experience shows that most systems use such a directory hierarchy, even in monolithic repositories, which is a good beginning for modularization efforts. Therefore, transforming a subfolder into a real submodule/subtree is fairly easy and can be done in the following sequence of steps:

  1. Move the subdirectory in question outside the working area of a superproject to have it beside the top directory of superproject. If it is important to keep the history of a subproject, consider using git filter-branch --subdirectory-filter or its equivalent, perhaps together with tools such as reposurgeon to clean up the history. See Chapter 8, Keeping History Clean for more details.
  2. Rename the directory with the subproject repository to better express the essence of the extracted component. For example, a subdirectory originally named refresh could be renamed to refresh-client-app-plugin.
  3. Create the public repository (upstream) for the subproject, as a first class project (for example, create a new project on GitHub to keep extracted code, either under the same organization as a superproject, or under a specialized organization for application plugins).
  4. Initialize now a self-sufficient and standalone plugin as a Git repository with git init. If in step 1 you have extracted the history of the subdirectory into some branch, then push this branch into the just created repository. Set up the public repository created in step 3 as a default remote repository and push the initial commit (or the whole history) to the just created URL to store the subproject code.
  5. In the superproject, read the subproject you have just extracted; this time, as a proper submodule or subtree, whichever solution is a better fit and whichever method you prefer to use. Use the URL of the just created public repository for the subproject.
  6. Commit the changes in the superproject and push them to its public repository, in the case of submodules including the newly created (or the just modified) .gitmodules file.

The recommended practice for the transformation of a subdirectory into a standalone submodule is to use a read-only URL for cloning (adding back) a submodule. This means that you can use either the git:// protocol (warning: in this case the server is unauthenticated) or https:// without a username. The goal of this recommendation is to enforce separation by moving the work on a submodule code to a standalone separate subproject repository. In order to ensure that the submodule commits are available to all other developers, every change should go through the public repository for a subproject.

If this recommendation (best practice) is met with a categorical refusal, in practice you could work on the subproject source code directly inside the superproject, though it is more error prone. You would need to remember to commit and push in the submodule first, doing it from inside of the nested submodule subdirectory; otherwise other developers would be not able to get the changes. This combined approach might be simpler to use, but it loses the true separation between implementing and consuming changes, which should be better assumed while using submodules.

Subtrees versus submodules

In general, subtrees are easier to use and less tricky. Many people go with submodules, because of the better built-in tooling (they have their own Git command, namely git submodule), detailed documentation, and similarity to the Subversion externals, making them feel falsely familiar. Adding a submodule is very simple (just run git submodule add), especially compared to adding a subtree without the help of third-party tools such as git subtree or git stree.

The major difference between subtrees and submodules is that, with subtrees, there's only one repository, which means just one lifecycle. Submodules and similar solutions use nested repositories, each with its own lifeline.

Though submodules are easy to set up and fairly flexible, they are also fraught with peril, and you need to practice vigilance while working with them. The fact that the submodules are opt-in also means that the changes touching the submodules demand a manual update by every collaborator. Subtrees are always there, so getting the superproject's changes mean getting the subproject's too.

Commands such as status, diff, and log display precious little information about submodules, unless properly configured to cross the repository boundary; it is easy to miss a change. With subtrees, status works normally, while diff and log need some care, because the subproject commits have a different root directory. The latter assumes that you did not decide to not include the subproject history (by squashing subtree merges). Then, the problem is only with the remote-tracking branches in subproject's repository, if any.

Because the lifecycles of different repositories are separate, updating a submodule inside its containing project requires two commits and two pushes. Updating a subtree-merged subproject is very simple: only one commit and one push. On the other hand, publishing the subproject changes upstream is much easier with submodules, while it requires changeset extraction with subtrees (here tools such as git subtree help a lot).

The next major issue, and a source of problems, is that the submodule has two sources of the current revision: the gitlink in the superproject and the branches in the submodule's clone of the repository. This means that git remote update works a bit like a sideways push into a nonbare repository (see Chapter 6, Advanced Branching Techniques). Submodule heads are therefore generally detached, so any local update requires various preparatory actions to avoid creating a lost commit. There is no such issue with subtrees. All the revision changing commands work as usual with subtrees, bringing the subproject directory to the correct version without the requirement of any additional action. Getting changes from the subproject repository is just a subtree merge away. The only difference between ordinary pull is the -s subtree option.

Still, sometimes submodules are the right choice. Compared to subtrees, they allow for a subproject (a module) to be not fetched, which is helpful when your code base is massive. Submodules are also useful when the heavy modularization is not natively handled, or not well natively handled, by the development stack's ecosystem.

Submodules might also themselves be superprojects for other submodules, creating a hierarchy of subprojects. Using nested submodules is made easier thanks to git submodule status, update, foreach, and sync subcommands all supporting the --recursive switch.

Use cases for subtrees

With subtrees, there is only one repository, no nested repositories, just like a regular codebase. This means that there is just one lifecycle. One of the key benefits of subtrees is being able to mix container-specific customizations with general purpose fixes and enhancements.

Projects can be organized and grouped together in whatever way you find to be most logically consistent. Using a single repository also reduces the overhead from managing dependencies.

The basic example of using subtrees is managing the customized version of a library, a required dependency. It is easy to get a development environment set up to run builds and tests. Monorepo makes it also viable to have one universal version number for all the projects. Atomic cross-submodule commits are possible; therefore, a repository can always be in a consistent state.

You can also use subtrees for embedding related projects, such as a GUI or a web interface, inside a superproject. In fact, many use cases for submodules can also apply to the subtrees solution, with an exception of the cases where there is a need for a subproject to be optional, or to have different access permissions than a master project. In those cases you need to use submodules.

Use cases for submodules

The strongest argument for the use of submodules is the issue of modularization. Here, the main area of use for submodules is handling plugins and extensions. Some programming ecosystems, such as ANSI C and C++, and also Objectve-C, lack good and standard support for managing version-locked multimodule projects. In this case, a plugin-like code can be included in the application (superproject) using submodules, without sacrificing the ability to easily update to the latest version of a plugin from its repository. The traditional solution of putting instructions about how to copy plugins in the README, disconnects it from the historical metadata.

This schema can be extended also to the noncompiled code, such as the Emacs List settings, configuration in dotfiles, (including frameworks such as oh-my-zsh), and themes (also for web applications). In these situations, what is usually needed to use a component is the physical presence of a module code at conventional locations inside the master project tree, which are mandated by the technology or framework being used. For instance, themes and plugins for Wordpress, Magento, and so on are often de facto installed this way. In many cases, you need to be in a superproject to test these optional components.

Yet another particular use case for submodules is the division based on access control and visibility restriction of a complex application. For example, the project might use a cryptographic code with license restrictions, limiting access to it to the small subset of developers. With this code in a submodule with restricted access to its repository, other developers would simply be unable to clone this submodule. In this solution, the common build system needs to be able to skip cryptographic component if it is not available. On the other hand, the dedicated build server can be configured in such a way that the client gets the application build with crypto enabled.

A similar visibility restriction purpose, but in reverse, is making the source code of examples available long before it was to be published. This allows for better code thanks to the social input. The main repository for a book itself can be closed (private), but having an examples/ directory contain a submodule intended for a sample source code allows you to make this subrepository public. While generating the book in the PDF and EPUB (and perhaps also MOBI) formats, the build process can then embed these examples (or fragments of them), as if they were ordinary subdirectory.

Third-party subproject management solutions

If you don't find a good fit in either git subtree or git submodule, you can try to use one of the many third-party projects to manage dependencies, subprojects, or collections of repositories. One such tool is the externals (or ext) project by Miles Georgie. You can find it at http://nopugs.com/ext-tutorial. This project is VCS-agnostic, and can be used to manage any combination of version control systems used by subprojects and superprojects.

Another is the repo tool (https://android.googlesource.com/tools/repo/) used by the Android Open Source project to unify the many Git repositories for across-network operations. You can find many other such tools.

Note

When choosing between native support and one of the many tools to manage many repositories together, you should check whether the tool in question uses a subtree-like or submodule-like approach to find if it would be a good fit for your project.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset