As far as I know all distributed revision control systems require you to clone the whole repository. For this reason is it not wise to put huge amounts of content into one s
As of version 2.0, it is not possible to make a so-called "narrow clone" with Mercurial, that is, a clone where you only retrieve data for a specific sub-directory. We call it a "shallow clone" when you only retrieve part of the history, say, the last 100 revisions.
As you say, there is nothing in the common DAG-based history model that excludes this feature and we have been working on it. Peter Arrenbrecht, a Mercurial contributor, has implemented two different approaches for narrow clones, but neither approach has been merged yet.
Btw, you can of course split an existing Mercurial repository into pieces where each smaller repository only has the history for a specific sub-directory of the original repository. The convert extension is the tool for this. Each of the smaller repositories will be unrelated to the bigger repository, though — the tricky part is to make the splitting seamless so that the changesets keep their identities.
As of Git 2.17 (Q2 2018, 10 years later), it will be possible to do what Mercurial planned to implement: a "narrow clone", that is, a clone where you only retrieve data for a specific sub-directory.
This is also called "partial clone".
That differs from the current
See commit 3aa6694, commit aa57b87, commit 35a7ae9, commit 1e1e39b, commit acb0c57, commit bc2d0c3, commit 640d8b7, commit 10ac85c (08 Dec 2017) by Jeff Hostetler (jeffhostetler).
See commit a1c6d7c, commit c0c578b, commit 548719f, commit a174334, commit 0b6069f (08 Dec 2017) by Jonathan Tan (jhowtan).
(Merged by Junio C Hamano -- gitster -- in commit 6bed209, 13 Feb 2018)
Here are the tests for a partial clone:
git clone --no-checkout --filter=blob:none "file://$(pwd)/srv.bare" pc1
There other other commits involved in that implementation of a narrow/partial clone.
In particular, commit 8b4c010:
sha1_file: support lazily fetching missing objects
Teach
sha1_file
to fetch objects from the remote configured inextensions.partialclone
whenever an object is requested but missing.
Warning regarding Git 2.17/2.18: The recent addition of "partial clone" experimental feature kicked in when it shouldn't, namely, when there is no partial-clone filter defined even if extensions.partialclone
is set.
See commit cac1137 (11 Jun 2018) by Jonathan Tan (jhowtan).
(Merged by Junio C Hamano -- gitster -- in commit 92e1bbc, 28 Jun 2018)
upload-pack
: disable object filtering when disabled by config
When
upload-pack
gained partial clone support (v2.17.0-rc0~132^2~12, 2017-12-08), it was guarded by theuploadpack.allowFilter
config item to allow server operators to control when they start supporting it.That config item didn't go far enough, though: it controls whether the '
filter
' capability is advertised, but if a (custom) client ignores the capability advertisement and passes a filter specification anyway, the server would handle that despite allowFilter being false.This is particularly significant if a security bug is discovered in this new experimental partial clone code.
Installations withoutuploadpack.allowFilter
ought not to be affected since they don't intend to support partial clone, but they would be swept up into being vulnerable.
This is enhanced with Git 2.20 (Q2 2018), since "git fetch $repo $object
" in a partial clone did not correctly fetch the asked-for object that is referenced by an object in promisor packfile, which has been fixed.
See commit 35f9e3e, commit 4937291 (21 Sep 2018) by Jonathan Tan (jhowtan).
(Merged by Junio C Hamano -- gitster -- in commit a1e9dff, 19 Oct 2018)
fetch
: in partial clone, check presence of targets
When fetching an object that is known as a promisor object to the local repository, the connectivity check in quickfetch() in builtin/fetch.c succeeds, causing object transfer to be bypassed.
However, this should not happen if that object is merely promised and not actually present.Because this happens, when a user invokes "
git fetch origin <sha-1>
" on the command-line, the<sha-1>
object may not actually be fetched even though the command returns an exit code of 0. This is a similar issue (but with a different cause) to the one fixed by a0c9016 ("upload-pack: send refs' objects despite "filter"", 2018-07-09, Git v2.19.0-rc0).Therefore, update quickfetch() to also directly check for the presence of all objects to be fetched.
You can list objects of a partial clone, excluding "promisor" objects, with git rev-list --exclude-promisor-objects
(For internal use only.) Prefilter object traversal at promisor boundary.
This is used with partial clone.
This is stronger than--missing=allow-promisor
because it limits the traversal, rather than just silencing errors about missing objects.
But make sure to use Git 2.21 (Q1 2019) to avoid segfault.
See commit 4cf6786 (05 Dec 2018) by Matthew DeVore (matvore).
(Merged by Junio C Hamano -- gitster -- in commit c333fe7, 14 Jan 2019)
"
git rev-list --exclude-promisor-objects
" had to take an object that does not exist locally (and is lazily available) from the command line without barfing, but the code dereferenced NULL.
list-objects.c
: don't segfault for missing cmdline objectsWhen a command is invoked with both
--exclude-promisor-objects
,--objects-edge-aggressive
, and a missing object on the command line, therev_info.cmdline
array could get a NULL pointer for the value of an 'item
' field.
Prevent dereferencing of aNULL
pointer in that situation.
Note that Git 2.21 (Q1 2019) fixes a bug:
See commit bbcde41 (03 Dec 2018) by Matthew DeVore (matvore).
(Merged by Junio C Hamano -- gitster -- in commit 6e5be1f, 14 Jan 2019)
exclude-promisor-objects
: declare when option is allowed
The
--exclude-promisor-objects
option causes some funny behavior in at least two commands:log
andblame
.
It causes a BUG crash:$ git log --exclude-promisor-objects BUG: revision.c:2143: exclude_promisor_objects can only be used when fetch_if_missing is 0 Aborted [134]
Fix this such that the option is treated like any other unknown option.
The commands that must support it are limited, so declare in those commands that the flag is supported.
In particular:pack-objects prune rev-list
The commands were found by searching for logic which parses
--exclude-promisor-objects
outside ofrevision.c
.
Extra logic outside ofrevision.c
is needed becausefetch_if_missing
must be turned on beforerevision.c
sees the option or it will BUG-crash. The above list is supported by the fact that no other command is introspectively invoked by another command passing--exclude-promisor-object
.
Git 2.22 (Q2 2019) optimizes narrow clone:
While running "git diff
" in a lazy clone, we can upfront know which
missing blobs we will need, instead of waiting for the on-demand
machinery to discover them one by one.
Aim to achieve better performance by batching the request for these promised blobs.
See commit 7fbbcb2 (05 Apr 2019), and commit 0f4a4fb (29 Mar 2019) by Jonathan Tan (jhowtan).
(Merged by Junio C Hamano -- gitster -- in commit 32dc15d, 25 Apr 2019)
diff
: batch fetching of missing blobs
When running a command like "
git show
" or "git diff
" in a partial clone, batch all missing blobs to be fetched as one request.This is similar to c0c578b ("
unpack-trees
: batch fetching of missing blobs", 2017-12-08, Git v2.17.0-rc0), but for another command.
Git 2.23 (Q3 2019) will futureproof that batch missing blob part.
See commit 31f5256 (28 May 2019) by Derrick Stolee (derrickstolee).
(Merged by Junio C Hamano -- gitster -- in commit 5d5c46b, 17 Jun 2019)
sha1-file
: splitOBJECT_INFO_FOR_PREFETCH
The
OBJECT_INFO_FOR_PREFETCH
bitflag was added tosha1-file.c
in 0f4a4fb (sha1-file
: supportOBJECT_INFO_FOR_PREFETCH
, 2019-03-29, Git v2.22.0-rc0) and is used to prevent thefetch_objects()
method when enabled.However, there is a problem with the current use.
The definition ofOBJECT_INFO_FOR_PREFETCH
is given by adding 32 toOBJECT_INFO_QUICK
.
This is clearly stated above the definition (in a comment) that this is soOBJECT_INFO_FOR_PREFETCH
impliesOBJECT_INFO_QUICK
.
The problem is that using "flag & OBJECT_INFO_FOR_PREFETCH
" means thatOBJECT_INFO_QUICK
also impliesOBJECT_INFO_FOR_PREFETCH
.Split out the single bit from
OBJECT_INFO_FOR_PREFETCH
into a newOBJECT_INFO_SKIP_FETCH_OBJECT
as the single bit and keepOBJECT_INFO_FOR_PREFETCH
as the union of two flags.
And "git fetch
" into a lazy clone forgot to fetch base objects that are
necessary to complete delta in a thin packfile, which has been
corrected.
See commit 810e193, commit 5718c53 (11 Jun 2019), and commit 8a30a1e, commit 385d1bf (14 May 2019) by Jonathan Tan (jhowtan).
(Merged by Junio C Hamano -- gitster -- in commit 8867aa8, 21 Jun 2019)
index-pack
: prefetch missingREF_DELTA
bases
When fetching, the client sends "
have
" commit IDs indicating that the server does not need to send any object referenced by those commits, reducing network I/O.
When the client is a partial clone, the client still sends "have
"s in this way, even if it does not have every object referenced by a commit it sent as "have
".If a server omits such an object, it is fine: the client could lazily fetch that object before this fetch, and it can still do so after.
The issue is when the server sends a thin pack containing an object that is a
REF_DELTA
against such a missing object:index-pack
fails to fix the thin pack.
When support for lazily fetching missing objects was added in 8b4c010 ("sha1_file
: support lazily fetching missing objects", 2017-12-08, Git v2.17.0-rc0), support inindex-pack
was turned off in the belief that it accesses the repo only to do hash collision checks.
However, this is not true: it also needs to access the repo to resolveREF_DELTA
bases.Support for lazy fetching should still generally be turned off in index-pack because it is used as part of the lazy fetching process itself (if not, infinite loops may occur), but we do need to fetch the
REF_DELTA
bases.
(When fetchingREF_DELTA
bases, it is unlikely that those areREF_DELTA
themselves, because we do not send "have
" when making such fetches.)To resolve this, prefetch all missing
REF_DELTA
bases before attempting to resolve them.
This both ensures that all bases are attempted to be fetched, and ensures that we make only one request per index-pack invocation, and not one request per missing object.
Git 2.24 (Q4 2019) fixes on-demand object fetching in lazy clone, which incorrectly tried to fetch commits from submodule projects, while still working in the superproject.
See commit a63694f (20 Aug 2019) by Jonathan Tan (jhowtan).
(Merged by Junio C Hamano -- gitster -- in commit d8b1ce7, 09 Sep 2019)
diff
: skipGITLINK
when lazy fetching missing objs
In 7fbbcb2 ("
diff
: batch fetching of missing blobs", 2019-04-08, Git v2.22.0-rc0),diff
was taught to batch the fetching of missing objects when operating on a partial clone, but was not taught to refrain from fetching GITLINKs.
Teach diff to check if an object is aGITLINK
before including it in the set to be fetched.
Git 2.24 (Q4 2019) also introduces the notion of promisor remote repository.
See commit 4ca9474, commit 60b7a92, commit db27dca, commit 75de085, commit 7e154ba, commit 9a4c507, commit 5e46139, commit fa3d1b6, commit b14ed5a, commit faf2abf, commit 9cfebc1, commit 9e27bea, commit 48de315, commit 2e86067, commit c59c7c8 (25 Jun 2019) by Christian Couder (chriscool).
(Merged by Junio C Hamano -- gitster -- in commit b9ac6c5, 18 Sep 2019)
The partial-clone documentation defines a promisor repo as:
A remote that can later provide the missing objects is called a promisor remote, as it promises to send the objects when requested.
Initialy Git supported only one promisor remote, the origin remote from which the user cloned and that was configured in the "
extensions.partialClone
" config option.
Later support for more than one promisor remote has been implemented.Many promisor remotes can be configured and used.
This allows for example a user to have multiple geographically-close cache servers for fetching missing blobs while continuing to do filtered
git-fetch
commands from the central server.Remotes that are considered "
promisor
" remotes are those specified by the following configuration variables:
extensions.partialClone = <name>
remote.<name>.promisor = true
remote.<name>.partialCloneFilter = ...
Only one promisor remote can be configured using the
extensions.partialClone
config variable. This promisor remote will be the last one tried when fetching objects.
Git 2.24 (Q4 2019) also improves the notion of filters in a partial clone.
See commit 90d21f9, commit 5a133e8, commit 489fc9e, commit c269495, commit cf9ceb5, commit f56f764, commit e987df5, commit 842b005, commit 7a7c7f4, commit 9430147 (27 Jun 2019) by Matthew DeVore (matvore).
(Merged by Junio C Hamano -- gitster -- in commit 627b826, 18 Sep 2019)
It allows for:
- combining filters such that only objects accepted by all filters are shown.
The motivation for this is to allow getting directory listings without also fetching blobs. This can be done by combiningblob:none
withtree:<depth>
.
There are massive repositories that have larger-than-expected trees - even if you include only a single commit.A combined filter supports any number of subfilters, and is written in the following form:
combine:<filter 1>+<filter 2>+<filter 3>
- combining of multiple filters by simply repeating the
--filter
flag.
Before, the user had to combine them in a single flag somewhat awkwardly (e.g.--filter=combine:FOO+BAR
), including URL-encoding the individual filters.
With Git 2.27 (Q2 2020), "git diff
" in a partial clone learned to avoid lazy loading blob objects in more cases when they are not needed.
See commit 95acf11, commit c14b6f8, commit 1c37e86 (07 Apr 2020), and commit db7ed74 (02 Apr 2020) by Jonathan Tan (jhowtan).
(Merged by Junio C Hamano -- gitster -- in commit 8f5dc5a, 28 Apr 2020)
diff: restrict when prefetching occurs
Helped-by: Jeff King
Signed-off-by: Jonathan Tan
Commit 7fbbcb21b1 ("
diff
: batch fetching of missing blobs", 2019-04-08, Git v2.22.0-rc0 -- merge listed in batch #7) optimized "diff
" by prefetching blobs in a partial clone, but there are some cases wherein blobs do not need to be prefetched.
In these cases, any command that uses the diff machinery will unnecessarily fetch blobs.
diffcore_std()
may read blobs when it calls the following functions:
diffcore_skip_stat_unmatch()
(controlled by the config variable diff.autorefreshindex)diffcore_break()
anddiffcore_merge_broken()
(for break-rewrite detection)diffcore_rename()
(for rename detection)diffcore_pickaxe()
(for detecting addition/deletion of specified string)Instead of always prefetching blobs, teach
diffcore_skip_stat_unmatch()
,diffcore_break()
, anddiffcore_rename()
to prefetch blobs upon the first read of a missing object.
This covers (1), (2), and (3): to cover the rest, teachdiffcore_std()
to prefetch if the output type is one that includes blob data (and hence blob data will be required later anyway), or if it knows that (4) will be run.
Note the lazy fetching done internally to make missing objects available in a partial clone incorrectly made permanent damage to the partial clone filter in the repository, which has been corrected with Git 2.29 (Q4 2020).
See commit 23547c4 (28 Sep 2020), and commit 625e7f1 (21 Sep 2020) by Jonathan Tan (jhowtan).
(Merged by Junio C Hamano -- gitster -- in commit e68f0a4, 05 Oct 2020)
fetch: do not override partial clone filter
Signed-off-by: Jonathan Tan
When a fetch with the
--filter
argument is made, the configured default filter is set even if one already exists. This change was made in 5e46139376 ("builtin/fetch
: remove unique promisor remote limitation", 2019-06-25, Git v2.24.0-rc0 -- merge listed in batch #3) - in particular, changing from:
- If this is the FIRST partial-fetch request, we enable partial
- on this repo and remember the given filter-spec as the default
- for subsequent fetches to this remote.
to:
- If this is a partial-fetch request, we enable partial on
- this repo if not already enabled and remember the given
- filter-spec as the default for subsequent fetches to this
- remote.
(The given filter-spec is "remembered" even if there is already an existing one.)
This is problematic whenever a lazy fetch is made, because lazy fetches are made using "git fetch --filter=blob:none(man), but this will also happen if the user invokes "git fetch --filter=<filter>(man)" manually. Therefore, restore the behavior prior to 5e46139376, which writes a filter-spec only if the current fetch request is the first partial-fetch one (for that remote).
In distributed rcs the history of a file (or a chunk of content) is a directed acyclic graph, so why can't you just clone this single DAG instead of the set of all graphs in the a repository?
At least in Git, the DAG representing the repository history applies to the whole repository, not just a single file. Each commit object points to a "tree" object which represents the entire state of the repository at that time.
Git 1.7 supports "sparse checkouts", which allow you to restrict the size of your working copy. The entire repository data is still cloned, however.
There's a subtree module for git, allowing you to split off a portion of a repository into a new repo and then merge changes to/from the original and the subtree. Here's its readme on github: http://github.com/apenwarr/git-subtree/blob/master/git-subtree.txt
I hope one of these RCS's will add narrow clone capability. My understanding is that the architecture of GIT (changes/moves tracked across the whole repo) makes this very difficult.
Bazaar prides itself on supporting many different types of workflows. Lack of narrow clone capability prohibits an SVN/CVS like workflow in bzr/hg/git, so I'm hoping they'll be motivated to find some way to do this.
New features shouldn't come at the expense of basic functionality, like the ability to fetch a single file/directory from the repo. The "distributed" feature of modern rcs's is "cool," but in my opinion discourages good development practices (frequent merges / continuous integration). These new RCS's all seem to lack very basic functionality. Even SVN without real branching/tagging support seemed like a step backwards from CVS imo.
From git help clone
:
--depth <depth>
Create a shallow clone with a history truncated to the specified number of revisions. A shallow repository has a number of limitations (you
cannot clone or fetch from it, nor push from nor into it), but is adequate if you are only interested in the recent history of a large project
with a long history, and would want to send in fixes as patches.
Does that provide something like what you're looking for?