This is sort of a follow-up to this question.
If there are multiple blobs with the same contents, they are only stored once in the git repository because their SHA-1\'s
Running this on the codebase I work on was an eye-opener I can tell you!
#!/usr/bin/perl
# usage: git ls-tree -r HEAD | $PROGRAM_NAME
use strict;
use warnings;
my $sha1_path = {};
while (my $line = ) {
chomp $line;
if ($line =~ m{ \A \d+ \s+ \w+ \s+ (\w+) \s+ (\S+) \z }xms) {
my $sha1 = $1;
my $path = $2;
push @{$sha1_path->{$sha1}}, $path;
}
}
foreach my $sha1 (keys %$sha1_path) {
if (scalar @{$sha1_path->{$sha1}} > 1) {
foreach my $path (@{$sha1_path->{$sha1}}) {
print "$sha1 $path\n";
}
print '-' x 40, "\n";
}
}