问题
Is there a better way of getting a raw list of SHA1s for ALL objects in a repository than doing ls .git/objects/??/\\*
and cat .git/objects/pack/*.idx | git show-index
?
I know about git rev-list --all
but that only lists commit objects that are referenced by .git/refs, and I\'m looking for everything including unreferenced objects that are created by git-hash-object, git-mktree etc.
回答1:
Edit: Aristotle posted an even better answer, which should be marked as correct.
Edit: the script contained a syntax error, missing backslash at the end of the grep -v
line
Mark's answer worked for me, after a few modifications:
- Used
--git-dir
instead of--show-cdup
to support bare repos - Avoided error when there are no packs
- Used
perl
because OS X Mountain Lion's BSD-stylesed
doesn't support-r
#!/bin/sh
set -e
cd "$(git rev-parse --git-dir)"
# Find all the objects that are in packs:
find objects/pack -name 'pack-*.idx' | while read p ; do
git show-index < $p | cut -f 2 -d ' '
done
# And now find all loose objects:
find objects/ \
| egrep '[0-9a-f]{38}' \
| grep -v /pack/ \
| perl -pe 's:^.*([0-9a-f][0-9a-f])/([0-9a-f]{38}):\1\2:' \
;
回答2:
Try
git rev-list --objects --all
Edit Josh made a good point:
git rev-list --objects -g --no-walk --all
list objects reachable from the ref-logs.
To see all objects in unreachable commits as well:
git rev-list --objects --no-walk \
$(git fsck --unreachable |
grep '^unreachable commit' |
cut -d' ' -f3)
Putting it all together, to really get all objects in the output format of rev-list --objects
, you need something like
{
git rev-list --objects --all
git rev-list --objects -g --no-walk --all
git rev-list --objects --no-walk \
$(git fsck --unreachable |
grep '^unreachable commit' |
cut -d' ' -f3)
} | sort | uniq
To sort the output in slightly more useful way (by path for tree/blobs, commits first) use an additional | sort -k2
which will group all different blobs (revisions) for identical paths.
回答3:
I don’t know since when this option exists but you can
git cat-file --batch-check --batch-all-objects
This gives you, according to the man page,
all objects in the repository and any alternate object stores (not just reachable objects)
(emphasis mine).
By default this yields the object type and it’s size together with each hash but you can easily remove this information, e.g. with
git cat-file --batch-check --batch-all-objects | cut -d' ' -f1
or by giving a custom format to --batch-check
.
回答4:
This is a more correct, simpler, and faster rendition of the script from the answers by Mark and by willkill.
It uses
rev-parse --git-path
to find theobjects
directory even in a more complex Git repository setup (e.g. in a multi-worktree situation or whatnot).It avoids all unnecessary use of
find
,grep
,perl
,sed
.If works gracefully even if you have no loose objects or no packs (or neither… if you’re inclined to run this on a fresh repository).
It does, however, require a Bash from this millennium 😊 (2.02 or newer, specifically, for the
extglob
bit).
Share and enjoy.
#!/bin/bash
set -e
shopt -s nullglob extglob
cd "`git rev-parse --git-path objects`"
# packed objects
for p in pack/pack-*([0-9a-f]).idx ; do
git show-index < $p | cut -f 2 -d ' '
done
# loose objects
for o in [0-9a-f][0-9a-f]/*([0-9a-f]) ; do
echo ${o/\/}
done
回答5:
The git cat-file --batch-check --batch-all-objects
command, suggested in Erki Der Loony's answer, can be made faster with the new Git 2.19 (Q3 2018) option --unordered
.
The API to iterate over all objects learned to optionally list objects in the order they appear in packfiles, which helps locality of access if the caller accesses these objects while as objects are enumerated.
See commit 0889aae, commit 79ed0a5, commit 54d2f0d, commit ced9fff (14 Aug 2018), and commit 0750bb5, commit b1adb38, commit aa2f5ef, commit 736eb88, commit 8b36155, commit a7ff6f5, commit 202e7f1 (10 Aug 2018) by Jeff King (peff). (Merged by Junio C Hamano -- gitster -- in commit 0c54cda, 20 Aug 2018)
cat-file
: support "unordered
" output for--batch-all-objects
If you're going to access the contents of every object in a packfile, it's generally much more efficient to do so in pack order, rather than in hash order. That increases the locality of access within the packfile, which in turn is friendlier to the delta base cache, since the packfile puts related deltas next to each other. By contrast, hash order is effectively random, since the sha1 has no discernible relationship to the content.
This patch introduces an "
--unordered
" option tocat-file
which iterates over packs in pack-order under the hood. You can see the results when dumping all of the file content:$ time ./git cat-file --batch-all-objects --buffer --batch | wc -c 6883195596 real 0m44.491s user 0m42.902s sys 0m5.230s $ time ./git cat-file --unordered \ --batch-all-objects --buffer --batch | wc -c 6883195596 real 0m6.075s user 0m4.774s sys 0m3.548s
Same output, different order, way faster. The same speed-up applies even if you end up accessing the object content in a different process, like:
git cat-file --batch-all-objects --buffer --batch-check | grep blob | git cat-file --batch='%(objectname) %(rest)' | wc -c
Adding "
--unordered
" to the first command drops the runtime ingit.git
from 24s to 3.5s.Side note: there are actually further speedups available for doing it all in-process now. Since we are outputting the object content during the actual pack iteration, we know where to find the object and could skip the extra lookup done by
oid_object_info()
. This patch stops short of that optimization since the underlying API isn't ready for us to make those sorts of direct requests.So if
--unordered
is so much better, why not make it the default? Two reasons:
We've promised in the documentation that
--batch-all-objects
outputs in hash order. Sincecat-file
is plumbing, people may be relying on that default, and we can't change it.It's actually slower for some cases. We have to compute the pack revindex to walk in pack order. And our de-duplication step uses an oidset, rather than a sort-and-dedup, which can end up being more expensive.
If we're just accessing the type and size of each object, for example, like:
git cat-file --batch-all-objects --buffer --batch-check
my best-of-five warm cache timings go from 900ms to 1100ms using
--unordered
. Though it's possible in a cold-cache or under memory pressure that we could do better, since we'd have better locality within the packfile.And one final question: why is it "
--unordered
" and not "--pack-order
"? The answer is again two-fold:
"pack order" isn't a well-defined thing across the whole set of objects. We're hitting loose objects, as well as objects in multiple packs, and the only ordering we're promising is within a single pack. The rest is apparently random.
The point here is optimization. So we don't want to promise any particular ordering, but only to say that we will choose an ordering which is likely to be efficient for accessing the object content. That leaves the door open for further changes in the future without having to add another compatibility option
It is even faster in Git 2.20 (Q4 2018) with:
See commit 8c84ae6, commit 8b2f8cb, commit 9249ca2, commit 22a1646, commit bf73282 (04 Oct 2018) by René Scharfe (rscharfe).
(Merged by Junio C Hamano -- gitster -- in commit 82d0a8c, 19 Oct 2018)
oidset
: usekhash
Reimplement
oidset
usingkhash.h
in order to reduce its memory footprint and make it faster.Performance of a command that mainly checks for duplicate objects using an oidset, with
master
and Clang 6.0.1:$ cmd="./git-cat-file --batch-all-objects --unordered --buffer --batch-check='%(objectname)'" $ /usr/bin/time $cmd >/dev/null 0.22user 0.03system 0:00.25elapsed 99%CPU (0avgtext+0avgdata 48484maxresident)k 0inputs+0outputs (0major+11204minor)pagefaults 0swaps $ hyperfine "$cmd" Benchmark #1: ./git-cat-file --batch-all-objects --unordered --buffer --batch-check='%(objectname)' Time (mean ± σ): 250.0 ms ± 6.0 ms [User: 225.9 ms, System: 23.6 ms] Range (min … max): 242.0 ms … 261.1 ms
And with this patch:
$ /usr/bin/time $cmd >/dev/null 0.14user 0.00system 0:00.15elapsed 100%CPU (0avgtext+0avgdata 41396maxresident)k 0inputs+0outputs (0major+8318minor)pagefaults 0swaps $ hyperfine "$cmd" Benchmark #1: ./git-cat-file --batch-all-objects --unordered --buffer --batch-check='%(objectname)' Time (mean ± σ): 151.9 ms ± 4.9 ms [User: 130.5 ms, System: 21.2 ms] Range (min … max): 148.2 ms … 170.4 ms
Git 2.21 (Q1 2019) optimizes further the codepath to write out commit-graph, by following the usual pattern of visiting objects in in-pack order.
See commit d7574c9 (19 Jan 2019) by Ævar Arnfjörð Bjarmason (avar).
(Merged by Junio C Hamano -- gitster -- in commit 04d67b6, 05 Feb 2019)
Slightly optimize the "commit-graph write" step by using
FOR_EACH_OBJECT_PACK_ORDER
withfor_each_object_in_pack()
.
Derrick Stolee did his own tests on Windows showing a 2% improvement with a high degree of accuracy.
Git 2.23 (Q3 2019) improves "git rev-list --objects" which learned with "--no-object-names
" option to squelch the path to the object that is used as a grouping hint for pack-objects.
See commit 42357b4 (19 Jun 2019) by Emily Shaffer (nasamuffin).
(Merged by Junio C Hamano -- gitster -- in commit f4f7e75, 09 Jul 2019)
rev-list
: teach--no-object-names
to enable pipingAllow easier parsing by
cat-file
by giving rev-list an option to print only the OID of a non-commit object without any additional information.
This is a short-term shim; later on,rev-list
should be taught how to print the types of objects it finds in a format similar tocat-file
's.Before this commit, the output from
rev-list
needed to be massaged before being piped to cat-file, like so:git rev-list --objects HEAD | cut -f 1 -d ' ' | git cat-file --batch-check
This was especially unexpected when dealing with root trees, as an invisible whitespace exists at the end of the OID:
git rev-list --objects --filter=tree:1 --max-count=1 HEAD | xargs -I% echo "AA%AA"
Now, it can be piped directly, as in the added test case:
git rev-list --objects --no-object-names HEAD | git cat-file --batch-check
So that is the difference between:
vonc@vonvb:~/gits/src/git$ git rev-list --objects HEAD~1..
9d418600f4d10dcbbfb0b5fdbc71d509e03ba719
590f2375e0f944e3b76a055acd2cb036823d4b44
55d368920b2bba16689cb6d4aef2a09e8cfac8ef Documentation
9903384d43ab88f5a124bc667f8d6d3a8bce7dff Documentation/RelNotes
a63204ffe8a040479654c3e44db6c170feca2a58 Documentation/RelNotes/2.23.0.txt
And, with --no-object-name
:
vonc@vonvb:~/gits/src/git$ git rev-list --objects --no-object-names HEAD~1..
9d418600f4d10dcbbfb0b5fdbc71d509e03ba719
590f2375e0f944e3b76a055acd2cb036823d4b44
55d368920b2bba16689cb6d4aef2a09e8cfac8ef
9903384d43ab88f5a124bc667f8d6d3a8bce7dff
a63204ffe8a040479654c3e44db6c170feca2a58
回答6:
I don't know of an obviously better way than just looking at all the loose object files and the indices of all pack files. The format of the git repository is very stable, and with this method you don't have to rely on having exactly the right options to git fsck
, which is classed as porcelain. I think this method is faster, as well. The following script shows all the objects in a repository:
#!/bin/sh
set -e
cd "$(git rev-parse --show-cdup)"
# Find all the objects that are in packs:
for p in .git/objects/pack/pack-*.idx
do
git show-index < $p | cut -f 2 -d ' '
done
# And now find all loose objects:
find .git/objects/ | egrep '[0-9a-f]{38}' | \
sed -r 's,^.*([0-9a-f][0-9a-f])/([0-9a-f]{38}),\1\2,'
(My original version of this script was based on this useful script to find the largest objects in your pack files, but I switched to using git show-index
, as suggested in your question.)
I've made this script into a GitHub gist.
回答7:
Another useful option is to use git verify-pack -v <packfile>
verify-pack -v
lists all objects in the database along with their object type.
来源:https://stackoverflow.com/questions/7348698/git-how-to-list-all-objects-in-the-database