Which commit has this blob?

前端 未结 7 2078
别那么骄傲
别那么骄傲 2020-11-22 02:45

Given the hash of a blob, is there a way to get a list of commits that have this blob in their tree?

相关标签:
7条回答
  • 2020-11-22 03:11

    Both of the following scripts take the blob’s SHA1 as the first argument, and after it, optionally, any arguments that git log will understand. E.g. --all to search in all branches instead of just the current one, or -g to search in the reflog, or whatever else you fancy.

    Here it is as a shell script – short and sweet, but slow:

    #!/bin/sh
    obj_name="$1"
    shift
    git log "$@" --pretty=format:'%T %h %s' \
    | while read tree commit subject ; do
        if git ls-tree -r $tree | grep -q "$obj_name" ; then
            echo $commit "$subject"
        fi
    done
    

    And an optimised version in Perl, still quite short but much faster:

    #!/usr/bin/perl
    use 5.008;
    use strict;
    use Memoize;
    
    my $obj_name;
    
    sub check_tree {
        my ( $tree ) = @_;
        my @subtree;
    
        {
            open my $ls_tree, '-|', git => 'ls-tree' => $tree
                or die "Couldn't open pipe to git-ls-tree: $!\n";
    
            while ( <$ls_tree> ) {
                /\A[0-7]{6} (\S+) (\S+)/
                    or die "unexpected git-ls-tree output";
                return 1 if $2 eq $obj_name;
                push @subtree, $2 if $1 eq 'tree';
            }
        }
    
        check_tree( $_ ) && return 1 for @subtree;
    
        return;
    }
    
    memoize 'check_tree';
    
    die "usage: git-find-blob <blob> [<git-log arguments ...>]\n"
        if not @ARGV;
    
    my $obj_short = shift @ARGV;
    $obj_name = do {
        local $ENV{'OBJ_NAME'} = $obj_short;
         `git rev-parse --verify \$OBJ_NAME`;
    } or die "Couldn't parse $obj_short: $!\n";
    chomp $obj_name;
    
    open my $log, '-|', git => log => @ARGV, '--pretty=format:%T %h %s'
        or die "Couldn't open pipe to git-log: $!\n";
    
    while ( <$log> ) {
        chomp;
        my ( $tree, $commit, $subject ) = split " ", $_, 3;
        print "$commit $subject\n" if check_tree( $tree );
    }
    
    0 讨论(0)
  • 2020-11-22 03:13

    So... I needed to find all files over a given limit in a repo over 8GB in size, with over 108,000 revisions. I adapted Aristotle's perl script along with a ruby script I wrote to reach this complete solution.

    First, git gc - do this to ensure all objects are in packfiles - we don't scan objects not in pack files.

    Next Run this script to locate all blobs over CUTOFF_SIZE bytes. Capture output to a file like "large-blobs.log"

    #!/usr/bin/env ruby
    
    require 'log4r'
    
    # The output of git verify-pack -v is:
    # SHA1 type size size-in-packfile offset-in-packfile depth base-SHA1
    #
    #
    GIT_PACKS_RELATIVE_PATH=File.join('.git', 'objects', 'pack', '*.pack')
    
    # 10MB cutoff
    CUTOFF_SIZE=1024*1024*10
    #CUTOFF_SIZE=1024
    
    begin
    
      include Log4r
      log = Logger.new 'git-find-large-objects'
      log.level = INFO
      log.outputters = Outputter.stdout
    
      git_dir = %x[ git rev-parse --show-toplevel ].chomp
    
      if git_dir.empty?
        log.fatal "ERROR: must be run in a git repository"
        exit 1
      end
    
      log.debug "Git Dir: '#{git_dir}'"
    
      pack_files = Dir[File.join(git_dir, GIT_PACKS_RELATIVE_PATH)]
      log.debug "Git Packs: #{pack_files.to_s}"
    
      # For details on this IO, see http://stackoverflow.com/questions/1154846/continuously-read-from-stdout-of-external-process-in-ruby
      #
      # Short version is, git verify-pack flushes buffers only on line endings, so
      # this works, if it didn't, then we could get partial lines and be sad.
    
      types = {
        :blob => 1,
        :tree => 1,
        :commit => 1,
      }
    
    
      total_count = 0
      counted_objects = 0
      large_objects = []
    
      IO.popen("git verify-pack -v -- #{pack_files.join(" ")}") do |pipe|
        pipe.each do |line|
          # The output of git verify-pack -v is:
          # SHA1 type size size-in-packfile offset-in-packfile depth base-SHA1
          data = line.chomp.split(' ')
          # types are blob, tree, or commit
          # we ignore other lines by looking for that
          next unless types[data[1].to_sym] == 1
          log.info "INPUT_THREAD: Processing object #{data[0]} type #{data[1]} size #{data[2]}"
          hash = {
            :sha1 => data[0],
            :type => data[1],
            :size => data[2].to_i,
          }
          total_count += hash[:size]
          counted_objects += 1
          if hash[:size] > CUTOFF_SIZE
            large_objects.push hash
          end
        end
      end
    
      log.info "Input complete"
    
      log.info "Counted #{counted_objects} totalling #{total_count} bytes."
    
      log.info "Sorting"
    
      large_objects.sort! { |a,b| b[:size] <=> a[:size] }
    
      log.info "Sorting complete"
    
      large_objects.each do |obj|
        log.info "#{obj[:sha1]} #{obj[:type]} #{obj[:size]}"
      end
    
      exit 0
    end
    

    Next, edit the file to remove any blobs you don't wait and the INPUT_THREAD bits at the top. once you have only lines for the sha1s you want to find, run the following script like this:

    cat edited-large-files.log | cut -d' ' -f4 | xargs git-find-blob | tee large-file-paths.log
    

    Where the git-find-blob script is below.

    #!/usr/bin/perl
    
    # taken from: http://stackoverflow.com/questions/223678/which-commit-has-this-blob
    # and modified by Carl Myers <cmyers@cmyers.org> to scan multiple blobs at once
    # Also, modified to keep the discovered filenames
    # vi: ft=perl
    
    use 5.008;
    use strict;
    use Memoize;
    use Data::Dumper;
    
    
    my $BLOBS = {};
    
    MAIN: {
    
        memoize 'check_tree';
    
        die "usage: git-find-blob <blob1> <blob2> ... -- [<git-log arguments ...>]\n"
            if not @ARGV;
    
    
        while ( @ARGV && $ARGV[0] ne '--' ) {
            my $arg = $ARGV[0];
            #print "Processing argument $arg\n";
            open my $rev_parse, '-|', git => 'rev-parse' => '--verify', $arg or die "Couldn't open pipe to git-rev-parse: $!\n";
            my $obj_name = <$rev_parse>;
            close $rev_parse or die "Couldn't expand passed blob.\n";
            chomp $obj_name;
            #$obj_name eq $ARGV[0] or print "($ARGV[0] expands to $obj_name)\n";
            print "($arg expands to $obj_name)\n";
            $BLOBS->{$obj_name} = $arg;
            shift @ARGV;
        }
        shift @ARGV; # drop the -- if present
    
        #print "BLOBS: " . Dumper($BLOBS) . "\n";
    
        foreach my $blob ( keys %{$BLOBS} ) {
            #print "Printing results for blob $blob:\n";
    
            open my $log, '-|', git => log => @ARGV, '--pretty=format:%T %h %s'
                or die "Couldn't open pipe to git-log: $!\n";
    
            while ( <$log> ) {
                chomp;
                my ( $tree, $commit, $subject ) = split " ", $_, 3;
                #print "Checking tree $tree\n";
                my $results = check_tree( $tree );
    
                #print "RESULTS: " . Dumper($results);
                if (%{$results}) {
                    print "$commit $subject\n";
                    foreach my $blob ( keys %{$results} ) {
                        print "\t" . (join ", ", @{$results->{$blob}}) . "\n";
                    }
                }
            }
        }
    
    }
    
    
    sub check_tree {
        my ( $tree ) = @_;
        #print "Calculating hits for tree $tree\n";
    
        my @subtree;
    
        # results = { BLOB => [ FILENAME1 ] }
        my $results = {};
        {
            open my $ls_tree, '-|', git => 'ls-tree' => $tree
                or die "Couldn't open pipe to git-ls-tree: $!\n";
    
            # example git ls-tree output:
            # 100644 blob 15d408e386400ee58e8695417fbe0f858f3ed424    filaname.txt
            while ( <$ls_tree> ) {
                /\A[0-7]{6} (\S+) (\S+)\s+(.*)/
                    or die "unexpected git-ls-tree output";
                #print "Scanning line '$_' tree $2 file $3\n";
                foreach my $blob ( keys %{$BLOBS} ) {
                    if ( $2 eq $blob ) {
                        print "Found $blob in $tree:$3\n";
                        push @{$results->{$blob}}, $3;
                    }
                }
                push @subtree, [$2, $3] if $1 eq 'tree';
            }
        }
    
        foreach my $st ( @subtree ) {
            # $st->[0] is tree, $st->[1] is dirname
            my $st_result = check_tree( $st->[0] );
            foreach my $blob ( keys %{$st_result} ) {
                foreach my $filename ( @{$st_result->{$blob}} ) {
                    my $path = $st->[1] . '/' . $filename;
                    #print "Generating subdir path $path\n";
                    push @{$results->{$blob}}, $path;
                }
            }
        }
    
        #print "Returning results for tree $tree: " . Dumper($results) . "\n\n";
        return $results;
    }
    

    The output will look like this:

    <hash prefix> <oneline log message>
        path/to/file.txt
        path/to/file2.txt
        ...
    <hash prefix2> <oneline log msg...>
    

    And so on. Every commit which contains a large file in its tree will be listed. if you grep out the lines that start with a tab, and uniq that, you will have a list of all paths you can filter-branch to remove, or you can do something more complicated.

    Let me reiterate: this process ran successfully, on a 10GB repo with 108,000 commits. It took much longer than I predicted when running on a large number of blobs though, over 10 hours, I will have to see if the memorize bit is working...

    0 讨论(0)
  • 2020-11-22 03:25

    In addition of git describe, that I mention in my previous answer, git log and git diff now benefits as well from the "--find-object=<object-id>" option to limit the findings to changes that involve the named object.
    That is in Git 2.16.x/2.17 (Q1 2018)

    See commit 4d8c51a, commit 5e50525, commit 15af58c, commit cf63051, commit c1ddc46, commit 929ed70 (04 Jan 2018) by Stefan Beller (stefanbeller).
    (Merged by Junio C Hamano -- gitster -- in commit c0d75f0, 23 Jan 2018)

    diffcore: add a pickaxe option to find a specific blob

    Sometimes users are given a hash of an object and they want to identify it further (ex.: Use verify-pack to find the largest blobs, but what are these? or this Stack Overflow question "Which commit has this blob?")

    One might be tempted to extend git-describe to also work with blobs, such that git describe <blob-id> gives a description as ':'.
    This was implemented here; as seen by the sheer number of responses (>110), it turns out this is tricky to get right.
    The hard part to get right is picking the correct 'commit-ish' as that could be the commit that (re-)introduced the blob or the blob that removed the blob; the blob could exist in different branches.

    Junio hinted at a different approach of solving this problem, which this patch implements.
    Teach the diff machinery another flag for restricting the information to what is shown.
    For example:

    $ ./git log --oneline --find-object=v2.0.0:Makefile
      b2feb64 Revert the whole "ask curl-config" topic for now
      47fbfde i18n: only extract comments marked with "TRANSLATORS:"
    

    we observe that the Makefile as shipped with 2.0 was appeared in v1.9.2-471-g47fbfded53 and in v2.0.0-rc1-5-gb2feb6430b.
    The reason why these commits both occur prior to v2.0.0 are evil merges that are not found using this new mechanism.

    0 讨论(0)
  • 2020-11-22 03:28

    Unfortunately scripts were a bit slow for me, so I had to optimize a bit. Luckily I had not only the hash but also the path of a file.

    git log --all --pretty=format:%H -- <path> | xargs -n1 -I% sh -c "git ls-tree % -- <path> | grep -q <hash> && echo %"
    
    0 讨论(0)
  • 2020-11-22 03:28

    I thought this would be a generally useful thing to have, so I wrote up a little perl script to do it:

    #!/usr/bin/perl -w
    
    use strict;
    
    my @commits;
    my %trees;
    my $blob;
    
    sub blob_in_tree {
        my $tree = $_[0];
        if (defined $trees{$tree}) {
            return $trees{$tree};
        }
        my $r = 0;
        open(my $f, "git cat-file -p $tree|") or die $!;
        while (<$f>) {
            if (/^\d+ blob (\w+)/ && $1 eq $blob) {
                $r = 1;
            } elsif (/^\d+ tree (\w+)/) {
                $r = blob_in_tree($1);
            }
            last if $r;
        }
        close($f);
        $trees{$tree} = $r;
        return $r;
    }
    
    sub handle_commit {
        my $commit = $_[0];
        open(my $f, "git cat-file commit $commit|") or die $!;
        my $tree = <$f>;
        die unless $tree =~ /^tree (\w+)$/;
        if (blob_in_tree($1)) {
            print "$commit\n";
        }
        while (1) {
            my $parent = <$f>;
            last unless $parent =~ /^parent (\w+)$/;
            push @commits, $1;
        }
        close($f);
    }
    
    if (!@ARGV) {
        print STDERR "Usage: git-find-blob blob [head ...]\n";
        exit 1;
    }
    
    $blob = $ARGV[0];
    if (@ARGV > 1) {
        foreach (@ARGV) {
            handle_commit($_);
        }
    } else {
        handle_commit("HEAD");
    }
    while (@commits) {
        handle_commit(pop @commits);
    }
    

    I'll put this up on github when I get home this evening.

    Update: It looks like somebody already did this. That one uses the same general idea but the details are different and the implementation is much shorter. I don't know which would be faster but performance is probably not a concern here!

    Update 2: For what it's worth, my implementation is orders of magnitude faster, especially for a large repository. That git ls-tree -r really hurts.

    Update 3: I should note that my performance comments above apply to the implementation I linked above in the first Update. Aristotle's implementation performs comparably to mine. More details in the comments for those who are curious.

    0 讨论(0)
  • 2020-11-22 03:29

    While the original question does not ask for it, I think it is useful to also check the staging area to see if a blob is referenced. I modified the original bash script to do this and found what was referencing a corrupt blob in my repository:

    #!/bin/sh
    obj_name="$1"
    shift
    git ls-files --stage \
    | if grep -q "$obj_name"; then
        echo Found in staging area. Run git ls-files --stage to see.
    fi
    
    git log "$@" --pretty=format:'%T %h %s' \
    | while read tree commit subject ; do
        if git ls-tree -r $tree | grep -q "$obj_name" ; then
            echo $commit "$subject"
        fi
    done
    
    0 讨论(0)
提交回复
热议问题