Is there a way to hard-link all the duplicate objects in a folder containing multiple Git repositories?
Explanation:
I am hosting a Git server on my company server (Linux machine). The idea is to have a main canonical repository, to which every user doesn't have push access to, but every user forks the canonical repository (clones the canonical to the user's home directory, thereby creating hard-links actually).
/canonical/Repo /Dev1/Repo (objects Hard-linked to /canonical/Repo to when initially cloned) /Dev2/Repo (objects Hard-linked to /canonical/Repo to when initially cloned)
This all works fine. The problem arises when:
Dev1: Pushes a huge commit onto his fork on server (/Dev1/Repo) Dev2: Fetches that on his local system, makes his own changes and pushes it to his own fork on server (/Dev2/Repo)
(Now the same 'huge' file resides in both the developer's forks on the server. It does not create a hard-link automatically.)
This is eating up my server space like crazy!
How can I create hard-links between the objects that are duplicate between the two forks or canonical for that matter, so that server space is saved and each developer when cloned from his/her fork on his/her local machine gets all the data?
I have decided to do this:
shared-objects-database.git/
foo.git/
objects/info/alternate (will have ../../shared-objects-database.git/objects)
bar.git/
objects/info/alternate (will have ../../shared-objects-database.git/objects)
baz.git/
objects/info/alternate (will have ../../shared-objects-database.git/objects)
All the forks will have an entry in their objects/info/alternates file that gives a relative path to the objects' database repository.
It is important to make the object database a repository, because we can save objects and refs of different users having a repository of the same name.
Steps:
git init --bare shared-object-database.git
I run the following lines of code either every time there is a push to any fork (via post-recieve) or by running a cronjob
for r in list-of-forks do
( cd "$r" && git push ../shared-objects-database.git "refs/:refs/remotes/$r/" && echo ../../shared-objects-database.git/objects >objects/info/alternates # to be save I add the "fat" objects to alternates every time ) done
Then in the next "git gc" all the objects in forks that already exist in alternate will be deleted.
git repack -adl
is also an option!
This way we save space so that two users pushing the same data on their respective forks on the server will share the objects.
We need to set the gc.pruneExpire
variable up to never
in the shared-object-database. Just to be safe!
To occasionally prune objects, add all forks as remotes to the shared, fetch, and prune! Git will do the rest!
(I finally found a solution that works for me! (Not tested in production! :p Thanks to this post.)
Now the same 'huge' file resides in both the developer's forks on the server. It does not create a hard-link automatically
Actually, with Git 2.20, that issue might disappear, because of delta islands, a new way of doing delta computation so that an object that exists in one fork is not made into a delta against another object that does not appear in the same forked repository.
See commit fe0ac2f, commit 108f530, commit f64ba53 (16 Aug 2018) by Christian Couder (chriscool
).
Helped-by: Jeff King (peff
), and Duy Nguyen (pclouds
).
See commit 9eb0986, commit 16d75fa, commit 28b8a73, commit c8d521f (16 Aug 2018) by Jeff King (peff
).
Helped-by: Jeff King (peff
), and Duy Nguyen (pclouds
).
(Merged by Junio C Hamano -- gitster
-- in commit f3504ea, 17 Sep 2018)
Add
delta-islands.{c,h}
Hosting providers that allow users to "fork" existing repositories want those forks to share as much disk space as possible.
Alternates are an existing solution to keep all the objects from all the forks into a unique central repository, but this can have some drawbacks.
Especially when packing the central repository, deltas will be created between objects from different forks.This can make cloning or fetching a fork much slower and much more CPU intensive as Git might have to compute new deltas for many objects to avoid sending objects from a different fork.
Because the inefficiency primarily arises when an object is deltified against another object that does not exist in the same fork, we partition objects into sets that appear in the same fork, and define "delta islands".
When finding delta base, we do not allow an object outside the same island to be considered as its base.So "delta islands" is a way to store objects from different forks in the same repository and packfile without having deltas between objects from different forks.
This patch implements the delta islands mechanism in "
delta-islands.{c,h}
", but does not yet make use of it.A few new fields are added in '
struct object_entry
' in "pack-objects.h
" though.
See Documentation/git-pack-objects.txt
: Delta Island:
DELTA ISLANDS
When possible,
pack-objects
tries to reuse existing on-disk deltas to avoid having to search for new ones on the fly. This is an important optimization for serving fetches, because it means the server can avoid inflating most objects at all and just send the bytes directly from disk.This optimization can't work when an object is stored as a delta against a base which the receiver does not have (and which we are not already sending). In that case the server "breaks" the delta and has to find a new one, which has a high CPU cost. Therefore it's important for performance that the set of objects in on-disk delta relationships match what a client would fetch.
In a normal repository, this tends to work automatically.
The objects are mostly reachable from the branches and tags, and that's what clients fetch. Any deltas we find on the server are likely to be between objects the client has or will have.But in some repository setups, you may have several related but separate groups of ref tips, with clients tending to fetch those groups independently.
For example, imagine that you are hosting several "forks" of a repository in a single shared object store, and letting clients view them as separate repositories through GIT_NAMESPACE or separate repositories using the alternates mechanism.
A naive repack may find that the optimal delta for an object is against a base that is only found in another fork.
But when a client fetches, they will not have the base object, and we'll have to find a new delta on the fly.A similar situation may exist if you have many refs outside of
refs/heads/
andrefs/tags/
that point to related objects (e.g.,refs/pull
orrefs/changes
used by some hosting providers). By default, clients fetch only heads and tags, and deltas against objects found only in those other groups cannot be sent as-is.Delta islands solve this problem by allowing you to group your refs into distinct "islands".
Pack-objects computes which objects are reachable from which islands, and refuses to make a delta from an object
A
against a base which is not present in all ofA
's islands. This results in slightly larger packs (because we miss some delta opportunities), but guarantees that a fetch of one island will not have to recompute deltas on the fly due to crossing island boundaries.
A side effect though: some commands were more verbose. Git 2.23 (Q3 2019) fixes this.
See commit bdbdf42 (20 Jun 2019) by Jeff King (peff
).
(Merged by Junio C Hamano -- gitster
-- in commit a4c8352, 09 Jul 2019)
delta-islands
: respectprogress
flagThe delta island code always prints "
Marked %d islands
", even if progress has been suppressed with--no-progress
or by sending stderr to a non-tty.Let's pass a
progress
boolean toload_delta_islands()
.
We already do the same thing for the progress meter inresolve_tree_islands()
.
来源:https://stackoverflow.com/questions/25843553/deduplicate-git-forks-on-a-server