List last commit dates for a large number of files, quickly

后端 未结 4 1993
攒了一身酷
攒了一身酷 2021-01-04 03:10

I would like to list the last commit date for a large number of files in a git repository.

For the sake of concreteness, let us assume that I want t

相关标签:
4条回答
  • 2021-01-04 03:30

    I'm somewhat late to the party here, but here's a little Bash script that uses the invocation in OP's #2, and does the postprocessing in awk. (For my use, I didn't need to see files that had gotten deleted as of the current date, so there's the existence check too.)

    #!/bin/bash
    (
        git ls-files | sed 's/^/+ /'
        git log --format=format:"~ %aI" --name-only .
    ) | gawk '
    /^~/ {date=$2;}
    /^+/ {extant[$2] = 1;}
    /^[^~+]/ {dates[$1] = date;}
    END { for (file in dates) if(extant[file]) print(dates[file], file); }
    ' | sort
    
    0 讨论(0)
  • 2021-01-04 03:33

    Try this.

    In git, each commit references a tree object which has pointers to the state of each file (the files being blob objects).

    So, what you want to do is write a program which starts out with a list of all the files in which you're interested, and begins at the HEAD object (SHA1 commit obtained via git rev-parse HEAD). It checks to see if any of the "files of interest" are modified in that tree (tree gotten from "tree" attribute of git cat-file commit [SHA1]) - note, you'll have to descend to the subtrees for each directory. If they are modified (meaning a different SHA1 hash from the one they had in the "previous" revision), it removes each such from the interest set and prints the appropriate information. Then it continues to each parent of the current tree. This continues until the set-of-interest is empty.

    If you want the maximal speed, you'll use the git C API. If you don't want that much speed, you can use git cat-file tree [SHA1 hash] (or, easier, git ls-tree [SHA1 hash] [files]), which is going to perform the absolute minimal amount of work to read a particular tree object (it's part of the plumbing layer).

    It's questionable how well this will continue to work in the future, but if forward-compat is a bigger issue you can move up a level from git cat-file - but as you already discovered, git log is comparatively slow as it's part of the porcelain, not the plumbing.

    See here for a pretty good resource on how git's object model works.

    0 讨论(0)
  • 2021-01-04 03:39

    Here is a Powershell function

    function Get-GitRevisionDates($Path='.', $Ext='.md')
    {
        [array] $log = git --no-pager log --format=format:%ai --name-only $Path
    
        $date_re = "^\d{4}-\d\d-\d\d \d\d:\d\d:\d\d .\d{4}$"
        [array] $dates = $log | Select-String $date_re | select LineNumber, Line
    
        $files = $log -notmatch "^$date_re$" | ? { $_.EndsWith($Ext) } | sort -unique
    
        $res = @()
        foreach ($file in $files) {
            $iFile = $log.IndexOf($file) + 1
            $fDate = $dates | ? LineNumber -lt $iFile | select -Last 1
            $res += [PSCustomObject]@{ File = $file; Date = $fDate.Line }
        }
    
        $res | sort Date -Desc
    }
    
    0 讨论(0)
  • 2021-01-04 03:53

    I also think your solution #2 is the fastest, you can find several scripts that use this method to set access times. A way to avoid printing older access times is to use e.g. a hash.

    I wrote some script in perl to modify access times, and after some modifications, this is a version which should print what you're after:

    #!/usr/bin/perl
    my $commit = $ARGV[0];
    
    $commit = 'HEAD' unless $commit;
    
    # git a list of access times and files
    my @logbook = `git whatchanged --pretty=%ai $commit`;
    
    my %seen;
    my $timestamp;
    my $filename;
    foreach (@logbook) {
        next if /^$/; # skip emtpy lines
        if (/^:/) {
            next unless /.txt$/;
            chomp ($filename = (split /\t/)[1]);
            next if $seen{$filename};
            print "$timestamp $filename\n";
            $seen{$filename} = 1;
        } else {
            chomp ($timestamp = $_);
        }
    }
    

    I used git whatchanged instead of git log to have a convenient format with non-time lines beginning with :, so I can easily separate the lines with files from the last modification times.

    0 讨论(0)
提交回复
热议问题