While receiving a Git push, GitLab can reject pushes using the
pre-receive
Git hook. Git has a special "object quarantine"
mechanism that allows it to eagerly delete rejected Git objects.
In this document we will explain how Git object quarantine works, and how GitLab is able to see quarantined objects.
Git object quarantine was introduced in Git 2.11.0 via https://gitlab.com/gitlab-org/git/-/commit/25ab004c53cdcfea485e5bf437aeaa74df47196d. To understand what it does we need to know how Git receives pushes on the server.
On a Git server, a push goes into git receive-pack
. This process does the following things:
- receive the Git objects pushed by the client and write them to disk
- receive the ref update commands from the client and keep them in memory
- check connectivity (no missing objects)
- run
pre-receive
and feed it the intended ref update commands - if
pre-receive
rejects the push, clean up and stop - apply ref update commands one by one. For each command, run the
update
hook which can reject the ref update. - after all ref updates have been applied run the
post-receive
hook - report success to the client and end the session
Object quarantine exists for the sake of the cleanup that happens when
pre-receive
rejects the push (step 5 above). It changes the timing of the
cleanup. Without object quarantine, objects that were part of a
rejected push would sit around until git gc
would judge them as both
unused and "old". How long that takes depends on how often git gc
runs (or git prune
), and on the configuration of when objects are
"old". Because of object quarantine, rejected objects can be deleted
immediately: Git can just rm -rf
the quarantine directory and
they're gone.
The Git implementation of this mechanism rests on two things.
The objects in a Git repository can be stored across multiple
directories: 1 main directory, usually /objects
, and 0 or more
alternate directories. Together these act like a search path: when
looking for an object Git first checks the main directory, then each
alternate, until it finds the object.
Git can inject custom config into subprocesses via environment
variables. In the case of Git object directories, these are
GIT_OBJECT_DIRECTORY
(the main object directory) and
GIT_ALTERNATE_OBJECT_DIRECTORIES
(a search path of :
-separated
alternate object directories).
git receive-pack
receives a pushgit receive-pack
creates a quarantine directoryobjects/incoming-$RANDOM
git receive-pack
configures the unpack process to write objects into the quarantine directorygit receive-pack
unpacks the objects into the quarantine directorygit receive-pack
runs thepre-receive
hook with specialGIT_OBJECT_DIRECTORY
andGIT_ALTERNATE_OBJECT_DIRECTORIES
environment variables that add the quarantine directory to the search path- If the
pre-receive
hook rejects the push,git receive-pack
removes the quarantine directory and its contents. The push is aborted. - If the
pre-receive
hook passes,git receive-pack
merges the quarantine directory into the main object directory. git receive-pack
enters the ref update transaction
Note that by the time the update
hook runs, the quarantine directory
has already been merged into the main object directory so it no longer
matters. The same goes for the post-receive
hook which runs even
later.
Because pre-receive
has the special quarantine configuration data in
environment variables, any git
process spawned by pre-receive
will
inherit the quarantine config and will be able to see the objects that
are being pushed.
GitLab uses Git hooks, among other things, to implement features that
can reject Git pushes. For example, you can mark a branch as
"protected" in the GitLab web UI, and then certain types of users can
no longer push to that branch. That feature is implemented via the Git
pre-receive
hook.
As mentioned above, Git object quarantine normally works more or less
automatically because git
commands spawned by the pre-receive
hook
inherit the special environment variables that contain the path to the
quarantine directory. In the case of GitLab's hooks we have a problem,
however, because the GitLab hooks are "dumb". All the GitLab hooks do
is take the inputs of the hook executable (the list of ref update
commands) and send them to the GitLab Rails internal API via a POST
request. The application logic that decides whether the push is
allowed resides in Rails. The hook just waits and reports back result
of the POST API request to GitLab.
During the POST, the internal GitLab API makes Gitaly calls back into the repo to examine the objects being pushed. For example, if force pushes are not allowed, GitLab will call the IsAncestor RPC. That RPC call then wants to look at a commit that is in the process of being pushed. But because that commit is in quarantine, the RPC will fail because the commit cannot be found.
To overcome this problem, the GitLab pre-receive
hook reads the
object directory configuration from its
environment.
and passes this information along with the HTTP API
call.
On the Rails side, we then put the object directory information in
the "request
store"
(i.e., request-scoped thread-local storage). And then during that
Rails request, when Rails makes Gitaly requests on this repo, we send
back the quarantine information in the Gitaly Repository
struct.
And finally, inside Gitaly, when we spawn a Git process, we re-create
the environment
variables
that were present on the pre-receive
hook, so that we can see the
quarantined objects. We do the same when we instantiate a
Gitlab::Git::Repository in
gitaly-ruby.
During the Gitaly migration we had to handle a complication with the
object quarantine information: Git uses absolute paths for this. These
paths get generated wherever git receive-pack
runs, i.e., on the
Gitaly server. During the migration, the repositories were also
accessible via NFS at the Rails side, but at a different path. That
meant that the absolute paths supplied by Git would be invalid part of
the time.
To work around this, the GitLab pre-receive
hook converts the
absolute paths from Git into relative
paths,
relative to the repository directory. These relative paths then get
passed around inside GitLab. At the time Gitaly recreates the object
directory variables, it converts the paths back from relative to
absolute.