Git is really slow for 100,000 objects. Any fixes?

后端 未结 11 1629
忘了有多久
忘了有多久 2020-11-27 13:23

I have a \"fresh\" git-svn repo (11.13 GB) that has over a 100,000 objects in it.

I have preformed

git fsck
git gc

on the repo afte

相关标签:
11条回答
  • 2020-11-27 13:40

    You also might try git repack

    0 讨论(0)
  • 2020-11-27 13:46

    git status should be quicker in Git 2.13 (Q2 2017), because of:

    • an optimization around array of string optimization (see "ways to improve git status performance")
    • a better "read cache" management.

    On that last point, see commit a33fc72 (14 Apr 2017) by Jeff Hostetler (jeffhostetler).
    (Merged by Junio C Hamano -- gitster -- in commit cdfe138, 24 Apr 2017)

    read-cache: force_verify_index_checksum

    Teach git to skip verification of the SHA1-1 checksum at the end of the index file in verify_hdr() which is called from read_index() unless the "force_verify_index_checksum" global variable is set.

    Teach fsck to force this verification.

    The checksum verification is for detecting disk corruption, and for small projects, the time it takes to compute SHA-1 is not that significant, but for gigantic repositories this calculation adds significant time to every command.


    Git 2.14 improves again git status performance by better taking into account the "untracked cache", which allows Git to skip reading the untracked directories if their stat data have not changed, using the mtime field of the stat structure.

    See the Documentation/technical/index-format.txt for more on untracked cache.

    See commit edf3b90 (08 May 2017) by David Turner (dturner-tw).
    (Merged by Junio C Hamano -- gitster -- in commit fa0624f, 30 May 2017)

    When "git checkout", "git merge", etc. manipulates the in-core index, various pieces of information in the index extensions are discarded from the original state, as it is usually not the case that they are kept up-to-date and in-sync with the operation on the main index.

    The untracked cache extension is copied across these operations now, which would speed up "git status" (as long as the cache is properly invalidated).


    More generally, writing to the cache will be also quicker with Git 2.14.x/2.15

    See commit ce012de, commit b50386c, commit 3921a0b (21 Aug 2017) by Kevin Willford (``).
    (Merged by Junio C Hamano -- gitster -- in commit 030faf2, 27 Aug 2017)

    We used to spend more than necessary cycles allocating and freeing piece of memory while writing each index entry out.
    This has been optimized.

    [That] would save anywhere between 3-7% when the index had over a million entries with no performance degradation on small repos.


    Update Dec. 2017: Git 2.16 (Q1 2018) will propose an additional enhancement, this time for git log, since the code to iterate over loose object files just got optimized.

    See commit 163ee5e (04 Dec 2017) by Derrick Stolee (derrickstolee).
    (Merged by Junio C Hamano -- gitster -- in commit 97e1f85, 13 Dec 2017)

    sha1_file: use strbuf_add() instead of strbuf_addf()

    Replace use of strbuf_addf() with strbuf_add() when enumerating loose objects in for_each_file_in_obj_subdir(). Since we already check the length and hex-values of the string before consuming the path, we can prevent extra computation by using the lower- level method.

    One consumer of for_each_file_in_obj_subdir() is the abbreviation code. OID (object identifiers) abbreviations use a cached list of loose objects (per object subdirectory) to make repeated queries fast, but there is significant cache load time when there are many loose objects.

    Most repositories do not have many loose objects before repacking, but in the GVFS case (see "Announcing GVFS (Git Virtual File System)") the repos can grow to have millions of loose objects.
    Profiling 'git log' performance in Git For Windows on a GVFS-enabled repo with ~2.5 million loose objects revealed 12% of the CPU time was spent in strbuf_addf().

    Add a new performance test to p4211-line-log.sh that is more sensitive to this cache-loading.
    By limiting to 1000 commits, we more closely resemble user wait time when reading history into a pager.

    For a copy of the Linux repo with two ~512 MB packfiles and ~572K loose objects, running 'git log --oneline --parents --raw -1000' had the following performance:

     HEAD~1            HEAD
    ----------------------------------------
     7.70(7.15+0.54)   7.44(7.09+0.29) -3.4%
    

    Update March 2018: Git 2.17 will improve git status some more: see this answer.


    Update: Git 2.20 (Q4 2018) adds Index Entry Offset Table (IEOT), which allows for git status to load the index faster.

    See commit 77ff112, commit 3255089, commit abb4bb8, commit c780b9c, commit 3b1d9e0, commit 371ed0d (10 Oct 2018) by Ben Peart (benpeart).
    See commit 252d079 (26 Sep 2018) by Nguyễn Thái Ngọc Duy (pclouds).
    (Merged by Junio C Hamano -- gitster -- in commit e27bfaa, 19 Oct 2018)

    read-cache: load cache entries on worker threads

    This patch helps address the CPU cost of loading the index by utilizing the Index Entry Offset Table (IEOT) to divide loading and conversion of the cache entries across multiple threads in parallel.

    I used p0002-read-cache.sh to generate some performance data:

    Test w/100,000 files reduced the time by 32.24%
    Test w/1,000,000 files reduced the time by -4.77%
    

    Note that on the 1,000,000 files case, multi-threading the cache entry parsing does not yield a performance win. This is because the cost to parse the index extensions in this repo, far outweigh the cost of loading the cache entries.

    That allows for:

    config: add new index.threads config setting

    Add support for a new index.threads config setting which will be used to control the threading code in do_read_index().

    • A value of 0 will tell the index code to automatically determine the correct number of threads to use.
      A value of 1 will make the code single threaded.
    • A value greater than 1 will set the maximum number of threads to use.

    For testing purposes, this setting can be overwritten by setting the GIT_TEST_INDEX_THREADS=<n> environment variable to a value greater than 0.


    Git 2.21 (Q1 2019) introduces a new improvement, with the update of the loose object cache, used to optimize existence look-up, which has been updated.

    See commit 8be88db (07 Jan 2019), and commit 4cea1ce, commit d4e19e5, commit 0000d65 (06 Jan 2019) by René Scharfe (rscharfe).
    (Merged by Junio C Hamano -- gitster -- in commit eb8638a, 18 Jan 2019)

    object-store: use one oid_array per subdirectory for loose cache

    The loose objects cache is filled one subdirectory at a time as needed.
    It is stored in an oid_array, which has to be resorted after each add operation.
    So when querying a wide range of objects, the partially filled array needs to be resorted up to 255 times, which takes over 100 times longer than sorting once.

    Use one oid_array for each subdirectory.
    This ensures that entries have to only be sorted a single time.
    It also avoids eight binary search steps for each cache lookup as a small bonus.

    The cache is used for collision checks for the log placeholders %h, %t and %p, and we can see the change speeding them up in a repository with ca. 100 objects per subdirectory:

    $ git count-objects
    26733 objects, 68808 kilobytes
    
    Test                        HEAD^             HEAD
    --------------------------------------------------------------------
    4205.1: log with %H         0.51(0.47+0.04)   0.51(0.49+0.02) +0.0%
    4205.2: log with %h         0.84(0.82+0.02)   0.60(0.57+0.03) -28.6%
    4205.3: log with %T         0.53(0.49+0.04)   0.52(0.48+0.03) -1.9%
    4205.4: log with %t         0.84(0.80+0.04)   0.60(0.59+0.01) -28.6%
    4205.5: log with %P         0.52(0.48+0.03)   0.51(0.50+0.01) -1.9%
    4205.6: log with %p         0.85(0.78+0.06)   0.61(0.56+0.05) -28.2%
    4205.7: log with %h-%h-%h   0.96(0.92+0.03)   0.69(0.64+0.04) -28.1%
    

    With Git 2.26 (Q1 2020), the object reachability bitmap machinery and the partial cloning machinery were not prepared to work well together, because some object-filtering criteria that partial clones use inherently rely on object traversal, but the bitmap machinery is an optimization to bypass that object traversal.

    There however are some cases where they can work together, and they were taught about them.

    See commit 20a5fd8 (18 Feb 2020) by Junio C Hamano (gitster).
    See commit 3ab3185, commit 84243da, commit 4f3bd56, commit cc4aa28, commit 2aaeb9a, commit 6663ae0, commit 4eb707e, commit ea047a8, commit 608d9c9, commit 55cb10f, commit 792f811, commit d90fe06 (14 Feb 2020), and commit e03f928, commit acac50d, commit 551cf8b (13 Feb 2020) by Jeff King (peff).
    (Merged by Junio C Hamano -- gitster -- in commit 0df82d9, 02 Mar 2020)

    pack-bitmap: implement BLOB_NONE filtering

    Signed-off-by: Jeff King

    We can easily support BLOB_NONE filters with bitmaps.
    Since we know the types of all of the objects, we just need to clear the result bits of any blobs.

    Note two subtleties in the implementation (which I also called out in comments):

    • we have to include any blobs that were specifically asked for (and not reached through graph traversal) to match the non-bitmap version
    • we have to handle in-pack and "ext_index" objects separately.
      Arguably prepare_bitmap_walk() could be adding these ext_index objects to the type bitmaps.
      But it doesn't for now, so let's match the rest of the bitmap code here (it probably wouldn't be an efficiency improvement to do so since the cost of extending those bitmaps is about the same as our loop here, but it might make the code a bit simpler).

    Here are perf results for the new test on git.git:

    Test                                    HEAD^             HEAD
    --------------------------------------------------------------------------------
    5310.9: rev-list count with blob:none   1.67(1.62+0.05)   0.22(0.21+0.02) -86.8%
    

    To know more aboud oid_array, consider Git 2.27 (Q2 2020)

    See commit 0740d0a, commit c79eddf, commit 7383b25, commit ed4b804, commit fe299ec, commit eccce52, commit 600bee4 (30 Mar 2020) by Jeff King (peff).
    (Merged by Junio C Hamano -- gitster -- in commit a768f86, 22 Apr 2020)

    oid_array: use size_t for count and allocation

    Signed-off-by: Jeff King

    The oid_array object uses an "int" to store the number of items and the allocated size.

    It's rather unlikely for somebody to have more than 2^31 objects in a repository (the sha1's alone would be 40GB!), but if they do, we'd overflow our alloc variable.

    You can reproduce this case with something like:

    git init repo
    cd repo
    
    # make a pack with 2^24 objects
    perl -e '
      my $nr = 2**24;
    
    for (my $i = 0; $i < $nr; $i++) {
     print "blob\n";
     print "data 4\n";
     print pack("N", $i);
    }
    | git fast-import
    
    # now make 256 copies of it; most of these objects will be duplicates,
    # but oid_array doesn't de-dup until all values are read and it can
    # sort the result.
    cd .git/objects/pack/
    pack=$(echo *.pack)
    idx=$(echo *.idx)
    for i in $(seq 0 255); do
      # no need to waste disk space
      ln "$pack" "pack-extra-$i.pack"
      ln "$idx" "pack-extra-$i.idx"
    done
    
    # and now force an oid_array to store all of it
    git cat-file --batch-all-objects --batch-check
    

    which results in:

    fatal: size_t overflow: 32 * 18446744071562067968
    

    So the good news is that st_mult() sees the problem (the large number is because our int wraps negative, and then that gets cast to a size_t), doing the job it was meant to: bailing in crazy situations rather than causing an undersized buffer.

    But we should avoid hitting this case at all, and instead limit ourselves based on what malloc() is willing to give us.
    We can easily do that by switching to size_t.

    The cat-file process above made it to ~120GB virtual set size before the integer overflow (our internal hash storage is 32-bytes now in preparation for sha256, so we'd expect ~128GB total needed, plus potentially more to copy from one realloc'd block to another)).
    After this patch (and about 130GB of RAM+swap), it does eventually read in the whole set. No test for obvious reasons.

    Note that this object was defined in sha1-array.c, which has been renamed oid-array.c: a more neutral name, considering Git will be eventually transition from SHA1 to SHA2.

    0 讨论(0)
  • 2020-11-27 13:51

    One longer-term solution is to augment git to cache filesystem status internally.

    Karsten Blees has done so for msysgit, which dramatically improves performance on Windows. In my experiments, his change has taken the time for "git status" from 25 seconds to 1-2 seconds on my Win7 machine running in a VM.

    Karsten's changes: https://github.com/msysgit/git/pull/94

    Discussion of the caching approach: https://groups.google.com/forum/#!topic/msysgit/fL_jykUmUNE/discussion

    0 讨论(0)
  • 2020-11-27 13:52

    In general my mac is ok with git but if there are a lot of loose objects then it gets very much slower. It seems hfs is not so good with lots of files in a single directory.

    git repack -ad
    

    Followed by

    git gc --prune=now
    

    Will make a single pack file and remove any loose objects left over. It can take some time to run these.

    0 讨论(0)
  • 2020-11-27 13:53

    You could try passing the --aggressive switch to git gc and see if that helps:

    # this will take a while ...
    git gc --aggressive
    

    Also, you could use git filter-branch to delete old commits and/or files if you have things which you don't need in your history (e.g., old binary files).

    0 讨论(0)
  • 2020-11-27 13:54

    For what it's worth, I recently found a large discrepancy beween the git status command between my master and dev branches.

    To cut a long story short, I tracked down the problem to a single 280MB file in the project root directory. It was an accidental checkin of a database dump so it was fine to delete it.

    Here's the before and after:

    ⚡ time git status
    # On branch master
    nothing to commit (working directory clean)
    git status  1.35s user 0.25s system 98% cpu 1.615 total
    
    ⚡ rm savedev.sql
    
    ⚡ time git status
    # On branch master
    # Changes not staged for commit:
    #   (use "git add/rm <file>..." to update what will be committed)
    #   (use "git checkout -- <file>..." to discard changes in working directory)
    #
    #   deleted:    savedev.sql
    #
    no changes added to commit (use "git add" and/or "git commit -a")
    git status  0.07s user 0.08s system 98% cpu 0.157 total
    

    I have 105,000 objects in store, but it appears that large files are more of a menace than many small files.

    0 讨论(0)
提交回复
热议问题