Given the hash of a blob, is there a way to get a list of commits that have this blob in their tree?
Both of the following scripts take the blob’s SHA1 as the first argument, and after it, optionally, any arguments that git log will understand. E.g. --all
to search in all branches instead of just the current one, or -g
to search in the reflog, or whatever else you fancy.
Here it is as a shell script – short and sweet, but slow:
#!/bin/sh
obj_name="$1"
shift
git log "$@" --pretty=format:'%T %h %s' \
| while read tree commit subject ; do
if git ls-tree -r $tree | grep -q "$obj_name" ; then
echo $commit "$subject"
fi
done
And an optimised version in Perl, still quite short but much faster:
#!/usr/bin/perl
use 5.008;
use strict;
use Memoize;
my $obj_name;
sub check_tree {
my ( $tree ) = @_;
my @subtree;
{
open my $ls_tree, '-|', git => 'ls-tree' => $tree
or die "Couldn't open pipe to git-ls-tree: $!\n";
while ( <$ls_tree> ) {
/\A[0-7]{6} (\S+) (\S+)/
or die "unexpected git-ls-tree output";
return 1 if $2 eq $obj_name;
push @subtree, $2 if $1 eq 'tree';
}
}
check_tree( $_ ) && return 1 for @subtree;
return;
}
memoize 'check_tree';
die "usage: git-find-blob <blob> [<git-log arguments ...>]\n"
if not @ARGV;
my $obj_short = shift @ARGV;
$obj_name = do {
local $ENV{'OBJ_NAME'} = $obj_short;
`git rev-parse --verify \$OBJ_NAME`;
} or die "Couldn't parse $obj_short: $!\n";
chomp $obj_name;
open my $log, '-|', git => log => @ARGV, '--pretty=format:%T %h %s'
or die "Couldn't open pipe to git-log: $!\n";
while ( <$log> ) {
chomp;
my ( $tree, $commit, $subject ) = split " ", $_, 3;
print "$commit $subject\n" if check_tree( $tree );
}
So... I needed to find all files over a given limit in a repo over 8GB in size, with over 108,000 revisions. I adapted Aristotle's perl script along with a ruby script I wrote to reach this complete solution.
First, git gc
- do this to ensure all objects are in packfiles - we don't scan objects not in pack files.
Next Run this script to locate all blobs over CUTOFF_SIZE bytes. Capture output to a file like "large-blobs.log"
#!/usr/bin/env ruby
require 'log4r'
# The output of git verify-pack -v is:
# SHA1 type size size-in-packfile offset-in-packfile depth base-SHA1
#
#
GIT_PACKS_RELATIVE_PATH=File.join('.git', 'objects', 'pack', '*.pack')
# 10MB cutoff
CUTOFF_SIZE=1024*1024*10
#CUTOFF_SIZE=1024
begin
include Log4r
log = Logger.new 'git-find-large-objects'
log.level = INFO
log.outputters = Outputter.stdout
git_dir = %x[ git rev-parse --show-toplevel ].chomp
if git_dir.empty?
log.fatal "ERROR: must be run in a git repository"
exit 1
end
log.debug "Git Dir: '#{git_dir}'"
pack_files = Dir[File.join(git_dir, GIT_PACKS_RELATIVE_PATH)]
log.debug "Git Packs: #{pack_files.to_s}"
# For details on this IO, see http://stackoverflow.com/questions/1154846/continuously-read-from-stdout-of-external-process-in-ruby
#
# Short version is, git verify-pack flushes buffers only on line endings, so
# this works, if it didn't, then we could get partial lines and be sad.
types = {
:blob => 1,
:tree => 1,
:commit => 1,
}
total_count = 0
counted_objects = 0
large_objects = []
IO.popen("git verify-pack -v -- #{pack_files.join(" ")}") do |pipe|
pipe.each do |line|
# The output of git verify-pack -v is:
# SHA1 type size size-in-packfile offset-in-packfile depth base-SHA1
data = line.chomp.split(' ')
# types are blob, tree, or commit
# we ignore other lines by looking for that
next unless types[data[1].to_sym] == 1
log.info "INPUT_THREAD: Processing object #{data[0]} type #{data[1]} size #{data[2]}"
hash = {
:sha1 => data[0],
:type => data[1],
:size => data[2].to_i,
}
total_count += hash[:size]
counted_objects += 1
if hash[:size] > CUTOFF_SIZE
large_objects.push hash
end
end
end
log.info "Input complete"
log.info "Counted #{counted_objects} totalling #{total_count} bytes."
log.info "Sorting"
large_objects.sort! { |a,b| b[:size] <=> a[:size] }
log.info "Sorting complete"
large_objects.each do |obj|
log.info "#{obj[:sha1]} #{obj[:type]} #{obj[:size]}"
end
exit 0
end
Next, edit the file to remove any blobs you don't wait and the INPUT_THREAD bits at the top. once you have only lines for the sha1s you want to find, run the following script like this:
cat edited-large-files.log | cut -d' ' -f4 | xargs git-find-blob | tee large-file-paths.log
Where the git-find-blob
script is below.
#!/usr/bin/perl
# taken from: http://stackoverflow.com/questions/223678/which-commit-has-this-blob
# and modified by Carl Myers <cmyers@cmyers.org> to scan multiple blobs at once
# Also, modified to keep the discovered filenames
# vi: ft=perl
use 5.008;
use strict;
use Memoize;
use Data::Dumper;
my $BLOBS = {};
MAIN: {
memoize 'check_tree';
die "usage: git-find-blob <blob1> <blob2> ... -- [<git-log arguments ...>]\n"
if not @ARGV;
while ( @ARGV && $ARGV[0] ne '--' ) {
my $arg = $ARGV[0];
#print "Processing argument $arg\n";
open my $rev_parse, '-|', git => 'rev-parse' => '--verify', $arg or die "Couldn't open pipe to git-rev-parse: $!\n";
my $obj_name = <$rev_parse>;
close $rev_parse or die "Couldn't expand passed blob.\n";
chomp $obj_name;
#$obj_name eq $ARGV[0] or print "($ARGV[0] expands to $obj_name)\n";
print "($arg expands to $obj_name)\n";
$BLOBS->{$obj_name} = $arg;
shift @ARGV;
}
shift @ARGV; # drop the -- if present
#print "BLOBS: " . Dumper($BLOBS) . "\n";
foreach my $blob ( keys %{$BLOBS} ) {
#print "Printing results for blob $blob:\n";
open my $log, '-|', git => log => @ARGV, '--pretty=format:%T %h %s'
or die "Couldn't open pipe to git-log: $!\n";
while ( <$log> ) {
chomp;
my ( $tree, $commit, $subject ) = split " ", $_, 3;
#print "Checking tree $tree\n";
my $results = check_tree( $tree );
#print "RESULTS: " . Dumper($results);
if (%{$results}) {
print "$commit $subject\n";
foreach my $blob ( keys %{$results} ) {
print "\t" . (join ", ", @{$results->{$blob}}) . "\n";
}
}
}
}
}
sub check_tree {
my ( $tree ) = @_;
#print "Calculating hits for tree $tree\n";
my @subtree;
# results = { BLOB => [ FILENAME1 ] }
my $results = {};
{
open my $ls_tree, '-|', git => 'ls-tree' => $tree
or die "Couldn't open pipe to git-ls-tree: $!\n";
# example git ls-tree output:
# 100644 blob 15d408e386400ee58e8695417fbe0f858f3ed424 filaname.txt
while ( <$ls_tree> ) {
/\A[0-7]{6} (\S+) (\S+)\s+(.*)/
or die "unexpected git-ls-tree output";
#print "Scanning line '$_' tree $2 file $3\n";
foreach my $blob ( keys %{$BLOBS} ) {
if ( $2 eq $blob ) {
print "Found $blob in $tree:$3\n";
push @{$results->{$blob}}, $3;
}
}
push @subtree, [$2, $3] if $1 eq 'tree';
}
}
foreach my $st ( @subtree ) {
# $st->[0] is tree, $st->[1] is dirname
my $st_result = check_tree( $st->[0] );
foreach my $blob ( keys %{$st_result} ) {
foreach my $filename ( @{$st_result->{$blob}} ) {
my $path = $st->[1] . '/' . $filename;
#print "Generating subdir path $path\n";
push @{$results->{$blob}}, $path;
}
}
}
#print "Returning results for tree $tree: " . Dumper($results) . "\n\n";
return $results;
}
The output will look like this:
<hash prefix> <oneline log message>
path/to/file.txt
path/to/file2.txt
...
<hash prefix2> <oneline log msg...>
And so on. Every commit which contains a large file in its tree will be listed. if you grep
out the lines that start with a tab, and uniq
that, you will have a list of all paths you can filter-branch to remove, or you can do something more complicated.
Let me reiterate: this process ran successfully, on a 10GB repo with 108,000 commits. It took much longer than I predicted when running on a large number of blobs though, over 10 hours, I will have to see if the memorize bit is working...
In addition of git describe, that I mention in my previous answer, git log
and git diff
now benefits as well from the "--find-object=<object-id>
" option to limit the findings to changes that involve the named object.
That is in Git 2.16.x/2.17 (Q1 2018)
See commit 4d8c51a, commit 5e50525, commit 15af58c, commit cf63051, commit c1ddc46, commit 929ed70 (04 Jan 2018) by Stefan Beller (stefanbeller).
(Merged by Junio C Hamano -- gitster -- in commit c0d75f0, 23 Jan 2018)
diffcore
: add a pickaxe option to find a specific blobSometimes users are given a hash of an object and they want to identify it further (ex.: Use verify-pack to find the largest blobs, but what are these? or this Stack Overflow question "Which commit has this blob?")
One might be tempted to extend
git-describe
to also work with blobs, such thatgit describe <blob-id>
gives a description as ':'.
This was implemented here; as seen by the sheer number of responses (>110), it turns out this is tricky to get right.
The hard part to get right is picking the correct 'commit-ish' as that could be the commit that (re-)introduced the blob or the blob that removed the blob; the blob could exist in different branches.Junio hinted at a different approach of solving this problem, which this patch implements.
Teach thediff
machinery another flag for restricting the information to what is shown.
For example:$ ./git log --oneline --find-object=v2.0.0:Makefile b2feb64 Revert the whole "ask curl-config" topic for now 47fbfde i18n: only extract comments marked with "TRANSLATORS:"
we observe that the
Makefile
as shipped with2.0
was appeared inv1.9.2-471-g47fbfded53
and inv2.0.0-rc1-5-gb2feb6430b
.
The reason why these commits both occur prior to v2.0.0 are evil merges that are not found using this new mechanism.
Unfortunately scripts were a bit slow for me, so I had to optimize a bit. Luckily I had not only the hash but also the path of a file.
git log --all --pretty=format:%H -- <path> | xargs -n1 -I% sh -c "git ls-tree % -- <path> | grep -q <hash> && echo %"
I thought this would be a generally useful thing to have, so I wrote up a little perl script to do it:
#!/usr/bin/perl -w
use strict;
my @commits;
my %trees;
my $blob;
sub blob_in_tree {
my $tree = $_[0];
if (defined $trees{$tree}) {
return $trees{$tree};
}
my $r = 0;
open(my $f, "git cat-file -p $tree|") or die $!;
while (<$f>) {
if (/^\d+ blob (\w+)/ && $1 eq $blob) {
$r = 1;
} elsif (/^\d+ tree (\w+)/) {
$r = blob_in_tree($1);
}
last if $r;
}
close($f);
$trees{$tree} = $r;
return $r;
}
sub handle_commit {
my $commit = $_[0];
open(my $f, "git cat-file commit $commit|") or die $!;
my $tree = <$f>;
die unless $tree =~ /^tree (\w+)$/;
if (blob_in_tree($1)) {
print "$commit\n";
}
while (1) {
my $parent = <$f>;
last unless $parent =~ /^parent (\w+)$/;
push @commits, $1;
}
close($f);
}
if (!@ARGV) {
print STDERR "Usage: git-find-blob blob [head ...]\n";
exit 1;
}
$blob = $ARGV[0];
if (@ARGV > 1) {
foreach (@ARGV) {
handle_commit($_);
}
} else {
handle_commit("HEAD");
}
while (@commits) {
handle_commit(pop @commits);
}
I'll put this up on github when I get home this evening.
Update: It looks like somebody already did this. That one uses the same general idea but the details are different and the implementation is much shorter. I don't know which would be faster but performance is probably not a concern here!
Update 2: For what it's worth, my implementation is orders of magnitude faster, especially for a large repository. That git ls-tree -r
really hurts.
Update 3: I should note that my performance comments above apply to the implementation I linked above in the first Update. Aristotle's implementation performs comparably to mine. More details in the comments for those who are curious.
While the original question does not ask for it, I think it is useful to also check the staging area to see if a blob is referenced. I modified the original bash script to do this and found what was referencing a corrupt blob in my repository:
#!/bin/sh
obj_name="$1"
shift
git ls-files --stage \
| if grep -q "$obj_name"; then
echo Found in staging area. Run git ls-files --stage to see.
fi
git log "$@" --pretty=format:'%T %h %s' \
| while read tree commit subject ; do
if git ls-tree -r $tree | grep -q "$obj_name" ; then
echo $commit "$subject"
fi
done