问题
As far as I know all distributed revision control systems require you to clone the whole repository. For this reason is it not wise to put huge amounts of content into one single repository (thanks for this answer). I know that this a not a bug but a feature, but I wonder whether this is a requirement for all distributed revision control systems.
In distributed rcs the history of a file (or a chunk of content) is a directed acyclic graph, so why can't you just clone this single DAG instead of the set of all graphs in the repository? Maybe I miss something but the following use-cases are hard to do:
- clone only a part of a repository
- merge two repositories (preserving their histories!)
- copy some files with their history from one repository to another
If I reuse parts of other people's code from multiple projects I cannot preserve their full history. At least in git I can think of a (rather complex) workaround:
- clone a full repository
- delete all content that I am not interested in
- rewrite the history to delete everything that is not in the master
- merge the remaining repository into an existing repository
I don't know if this is also possible with Mercurial or Bazaar but at least it is not easy at all. So is there any distributed rcs that supports partial checkout/clone by design? It should support one simple command to get a single file with its history from one repository and merge it into another. This way you would not need to think about how to structure your content into repositories and submodules but you could happily split and merge repositories as needed (the extreme would be one repository for each single file).
回答1:
As of version 2.0, it is not possible to make a so-called "narrow clone" with Mercurial, that is, a clone where you only retrieve data for a specific sub-directory. We call it a "shallow clone" when you only retrieve part of the history, say, the last 100 revisions.
As you say, there is nothing in the common DAG-based history model that excludes this feature and we have been working on it. Peter Arrenbrecht, a Mercurial contributor, has implemented two different approaches for narrow clones, but neither approach has been merged yet.
Btw, you can of course split an existing Mercurial repository into pieces where each smaller repository only has the history for a specific sub-directory of the original repository. The convert extension is the tool for this. Each of the smaller repositories will be unrelated to the bigger repository, though — the tricky part is to make the splitting seamless so that the changesets keep their identities.
回答2:
There's a subtree module for git, allowing you to split off a portion of a repository into a new repo and then merge changes to/from the original and the subtree. Here's its readme on github: http://github.com/apenwarr/git-subtree/blob/master/git-subtree.txt
回答3:
In distributed rcs the history of a file (or a chunk of content) is a directed acyclic graph, so why can't you just clone this single DAG instead of the set of all graphs in the a repository?
At least in Git, the DAG representing the repository history applies to the whole repository, not just a single file. Each commit object points to a "tree" object which represents the entire state of the repository at that time.
Git 1.7 supports "sparse checkouts", which allow you to restrict the size of your working copy. The entire repository data is still cloned, however.
回答4:
As of Git 2.17 (Q2 2018, 10 years later), it will be possible to do what Mercurial planned to implement: a "narrow clone", that is, a clone where you only retrieve data for a specific sub-directory.
This is also called "partial clone".
That differs from the current
- shallow clone
- copy of what you need from the cloned repo in another working folder.
See commit 3aa6694, commit aa57b87, commit 35a7ae9, commit 1e1e39b, commit acb0c57, commit bc2d0c3, commit 640d8b7, commit 10ac85c (08 Dec 2017) by Jeff Hostetler (jeffhostetler).
See commit a1c6d7c, commit c0c578b, commit 548719f, commit a174334, commit 0b6069f (08 Dec 2017) by Jonathan Tan (jhowtan).
(Merged by Junio C Hamano -- gitster -- in commit 6bed209, 13 Feb 2018)
Here are the tests for a partial clone:
git clone --no-checkout --filter=blob:none "file://$(pwd)/srv.bare" pc1
There other other commits involved in that implementation of a narrow/partial clone.
In particular, commit 8b4c010:
sha1_file: support lazily fetching missing objects
Teach
sha1_file
to fetch objects from the remote configured inextensions.partialclone
whenever an object is requested but missing.
Warning regarding Git 2.17/2.18: The recent addition of "partial clone" experimental feature kicked in when it shouldn't, namely, when there is no partial-clone filter defined even if extensions.partialclone
is set.
See commit cac1137 (11 Jun 2018) by Jonathan Tan (jhowtan).
(Merged by Junio C Hamano -- gitster -- in commit 92e1bbc, 28 Jun 2018)
upload-pack
: disable object filtering when disabled by configWhen
upload-pack
gained partial clone support (v2.17.0-rc0~132^2~12, 2017-12-08), it was guarded by theuploadpack.allowFilter
config item to allow server operators to control when they start supporting it.That config item didn't go far enough, though: it controls whether the '
filter
' capability is advertised, but if a (custom) client ignores the capability advertisement and passes a filter specification anyway, the server would handle that despite allowFilter being false.This is particularly significant if a security bug is discovered in this new experimental partial clone code.
Installations withoutuploadpack.allowFilter
ought not to be affected since they don't intend to support partial clone, but they would be swept up into being vulnerable.
This is enhanced with Git 2.20 (Q2 2018), since "git fetch $repo $object
" in a partial clone did not correctly fetch the asked-for object that is referenced by an object in promisor packfile, which has been fixed.
See commit 35f9e3e, commit 4937291 (21 Sep 2018) by Jonathan Tan (jhowtan).
(Merged by Junio C Hamano -- gitster -- in commit a1e9dff, 19 Oct 2018)
fetch
: in partial clone, check presence of targetsWhen fetching an object that is known as a promisor object to the local repository, the connectivity check in quickfetch() in builtin/fetch.c succeeds, causing object transfer to be bypassed.
However, this should not happen if that object is merely promised and not actually present.Because this happens, when a user invokes "
git fetch origin <sha-1>
" on the command-line, the<sha-1>
object may not actually be fetched even though the command returns an exit code of 0. This is a similar issue (but with a different cause) to the one fixed by a0c9016 ("upload-pack: send refs' objects despite "filter"", 2018-07-09, Git v2.19.0-rc0).Therefore, update quickfetch() to also directly check for the presence of all objects to be fetched.
You can list objects of a partial clone, excluding "promisor" objects, with git rev-list --exclude-promisor-objects
(For internal use only.) Prefilter object traversal at promisor boundary.
This is used with partial clone.
This is stronger than--missing=allow-promisor
because it limits the traversal, rather than just silencing errors about missing objects.
But make sure to use Git 2.21 (Q1 2019) to avoid segfault.
See commit 4cf6786 (05 Dec 2018) by Matthew DeVore (matvore).
(Merged by Junio C Hamano -- gitster -- in commit c333fe7, 14 Jan 2019)
"
git rev-list --exclude-promisor-objects
" had to take an object that does not exist locally (and is lazily available) from the command line without barfing, but the code dereferenced NULL.
list-objects.c
: don't segfault for missing cmdline objects
When a command is invoked with both
--exclude-promisor-objects
,--objects-edge-aggressive
, and a missing object on the command line, therev_info.cmdline
array could get a NULL pointer for the value of an 'item
' field.
Prevent dereferencing of aNULL
pointer in that situation.
Note that Git 2.21 (Q1 2019) fixes a bug:
See commit bbcde41 (03 Dec 2018) by Matthew DeVore (matvore).
(Merged by Junio C Hamano -- gitster -- in commit 6e5be1f, 14 Jan 2019)
exclude-promisor-objects
: declare when option is allowedThe
--exclude-promisor-objects
option causes some funny behavior in at least two commands:log
andblame
.
It causes a BUG crash:$ git log --exclude-promisor-objects BUG: revision.c:2143: exclude_promisor_objects can only be used when fetch_if_missing is 0 Aborted [134]
Fix this such that the option is treated like any other unknown option.
The commands that must support it are limited, so declare in those commands that the flag is supported.
In particular:pack-objects prune rev-list
The commands were found by searching for logic which parses
--exclude-promisor-objects
outside ofrevision.c
.
Extra logic outside ofrevision.c
is needed becausefetch_if_missing
must be turned on beforerevision.c
sees the option or it will BUG-crash. The above list is supported by the fact that no other command is introspectively invoked by another command passing--exclude-promisor-object
.
Git 2.22 (Q2 2019) optimizes narrow clone:
While running "git diff
" in a lazy clone, we can upfront know which
missing blobs we will need, instead of waiting for the on-demand
machinery to discover them one by one.
Aim to achieve better performance by batching the request for these promised blobs.
See commit 7fbbcb2 (05 Apr 2019), and commit 0f4a4fb (29 Mar 2019) by Jonathan Tan (jhowtan).
(Merged by Junio C Hamano -- gitster -- in commit 32dc15d, 25 Apr 2019)
diff
: batch fetching of missing blobsWhen running a command like "
git show
" or "git diff
" in a partial clone, batch all missing blobs to be fetched as one request.This is similar to c0c578b ("
unpack-trees
: batch fetching of missing blobs", 2017-12-08, Git v2.17.0-rc0), but for another command.
Git 2.23 (Q3 2019) will futureproof that batch missing blob part.
See commit 31f5256 (28 May 2019) by Derrick Stolee (derrickstolee).
(Merged by Junio C Hamano -- gitster -- in commit 5d5c46b, 17 Jun 2019)
sha1-file
: splitOBJECT_INFO_FOR_PREFETCH
The
OBJECT_INFO_FOR_PREFETCH
bitflag was added tosha1-file.c
in 0f4a4fb (sha1-file
: supportOBJECT_INFO_FOR_PREFETCH
, 2019-03-29, Git v2.22.0-rc0) and is used to prevent thefetch_objects()
method when enabled.However, there is a problem with the current use.
The definition ofOBJECT_INFO_FOR_PREFETCH
is given by adding 32 toOBJECT_INFO_QUICK
.
This is clearly stated above the definition (in a comment) that this is soOBJECT_INFO_FOR_PREFETCH
impliesOBJECT_INFO_QUICK
.
The problem is that using "flag & OBJECT_INFO_FOR_PREFETCH
" means thatOBJECT_INFO_QUICK
also impliesOBJECT_INFO_FOR_PREFETCH
.Split out the single bit from
OBJECT_INFO_FOR_PREFETCH
into a newOBJECT_INFO_SKIP_FETCH_OBJECT
as the single bit and keepOBJECT_INFO_FOR_PREFETCH
as the union of two flags.
And "git fetch
" into a lazy clone forgot to fetch base objects that are
necessary to complete delta in a thin packfile, which has been
corrected.
See commit 810e193, commit 5718c53 (11 Jun 2019), and commit 8a30a1e, commit 385d1bf (14 May 2019) by Jonathan Tan (jhowtan).
(Merged by Junio C Hamano -- gitster -- in commit 8867aa8, 21 Jun 2019)
index-pack
: prefetch missingREF_DELTA
basesWhen fetching, the client sends "
have
" commit IDs indicating that the server does not need to send any object referenced by those commits, reducing network I/O.
When the client is a partial clone, the client still sends "have
"s in this way, even if it does not have every object referenced by a commit it sent as "have
".If a server omits such an object, it is fine: the client could lazily fetch that object before this fetch, and it can still do so after.
The issue is when the server sends a thin pack containing an object that is a
REF_DELTA
against such a missing object:index-pack
fails to fix the thin pack.
When support for lazily fetching missing objects was added in 8b4c010 ("sha1_file
: support lazily fetching missing objects", 2017-12-08, Git v2.17.0-rc0), support inindex-pack
was turned off in the belief that it accesses the repo only to do hash collision checks.
However, this is not true: it also needs to access the repo to resolveREF_DELTA
bases.Support for lazy fetching should still generally be turned off in index-pack because it is used as part of the lazy fetching process itself (if not, infinite loops may occur), but we do need to fetch the
REF_DELTA
bases.
(When fetchingREF_DELTA
bases, it is unlikely that those areREF_DELTA
themselves, because we do not send "have
" when making such fetches.)To resolve this, prefetch all missing
REF_DELTA
bases before attempting to resolve them.
This both ensures that all bases are attempted to be fetched, and ensures that we make only one request per index-pack invocation, and not one request per missing object.
Git 2.24 (Q4 2019) fixes on-demand object fetching in lazy clone, which incorrectly tried to fetch commits from submodule projects, while still working in the superproject.
See commit a63694f (20 Aug 2019) by Jonathan Tan (jhowtan).
(Merged by Junio C Hamano -- gitster -- in commit d8b1ce7, 09 Sep 2019)
diff
: skipGITLINK
when lazy fetching missing objsIn 7fbbcb2 ("
diff
: batch fetching of missing blobs", 2019-04-08, Git v2.22.0-rc0),diff
was taught to batch the fetching of missing objects when operating on a partial clone, but was not taught to refrain from fetching GITLINKs.
Teach diff to check if an object is aGITLINK
before including it in the set to be fetched.
Git 2.24 (Q4 2019) also introduces the notion of promisor remote repository.
See commit 4ca9474, commit 60b7a92, commit db27dca, commit 75de085, commit 7e154ba, commit 9a4c507, commit 5e46139, commit fa3d1b6, commit b14ed5a, commit faf2abf, commit 9cfebc1, commit 9e27bea, commit 48de315, commit 2e86067, commit c59c7c8 (25 Jun 2019) by Christian Couder (chriscool).
(Merged by Junio C Hamano -- gitster -- in commit b9ac6c5, 18 Sep 2019)
The partial-clone documentation defines a promisor repo as:
A remote that can later provide the missing objects is called a promisor remote, as it promises to send the objects when requested.
Initialy Git supported only one promisor remote, the origin remote from which the user cloned and that was configured in the "
extensions.partialClone
" config option.
Later support for more than one promisor remote has been implemented.Many promisor remotes can be configured and used.
This allows for example a user to have multiple geographically-close cache servers for fetching missing blobs while continuing to do filtered
git-fetch
commands from the central server.Remotes that are considered "
promisor
" remotes are those specified by the following configuration variables:
extensions.partialClone = <name>
remote.<name>.promisor = true
remote.<name>.partialCloneFilter = ...
Only one promisor remote can be configured using the
extensions.partialClone
config variable. This promisor remote will be the last one tried when fetching objects.
Git 2.24 (Q4 2019) also improves the notion of filters in a partial clone.
See commit 90d21f9, commit 5a133e8, commit 489fc9e, commit c269495, commit cf9ceb5, commit f56f764, commit e987df5, commit 842b005, commit 7a7c7f4, commit 9430147 (27 Jun 2019) by Matthew DeVore (matvore).
(Merged by Junio C Hamano -- gitster -- in commit 627b826, 18 Sep 2019)
It allows for:
- combining filters such that only objects accepted by all filters are shown.
The motivation for this is to allow getting directory listings without also fetching blobs. This can be done by combiningblob:none
withtree:<depth>
.
There are massive repositories that have larger-than-expected trees - even if you include only a single commit.
A combined filter supports any number of subfilters, and is written in the following form:
combine:<filter 1>+<filter 2>+<filter 3>
- combining of multiple filters by simply repeating the
--filter
flag.
Before, the user had to combine them in a single flag somewhat awkwardly (e.g.--filter=combine:FOO+BAR
), including URL-encoding the individual filters.
回答5:
In bazaar you can split and join parts of a repository.
The split-command allows you to split a repository into multiple repositories. The join-command allows you to merge repositories. Both keep the history.
However this isn't as handy a the SVN-model, where you can checkout/commit for a sub-tree.
There's a planned feature called Nested-Trees for bazaar, which maybe would allow partial checkouts.
回答6:
I hope one of these RCS's will add narrow clone capability. My understanding is that the architecture of GIT (changes/moves tracked across the whole repo) makes this very difficult.
Bazaar prides itself on supporting many different types of workflows. Lack of narrow clone capability prohibits an SVN/CVS like workflow in bzr/hg/git, so I'm hoping they'll be motivated to find some way to do this.
New features shouldn't come at the expense of basic functionality, like the ability to fetch a single file/directory from the repo. The "distributed" feature of modern rcs's is "cool," but in my opinion discourages good development practices (frequent merges / continuous integration). These new RCS's all seem to lack very basic functionality. Even SVN without real branching/tagging support seemed like a step backwards from CVS imo.
回答7:
From git help clone
:
--depth <depth>
Create a shallow clone with a history truncated to the specified number of revisions. A shallow repository has a number of limitations (you
cannot clone or fetch from it, nor push from nor into it), but is adequate if you are only interested in the recent history of a large project
with a long history, and would want to send in fixes as patches.
Does that provide something like what you're looking for?
来源:https://stackoverflow.com/questions/3098029/is-there-any-distributed-revision-control-system-that-supports-partial-checkout