How do I reduce the size of a bloated Git repo by non-interactively squashing all commits except for the most recent ones?

问题

My Git repo has hundreds of gigabytes of data, say, database backups, so I'm trying to remove old, outdated backups, because they're making everything larger and slower. So I naturally need something that's fast; the faster, the better.

How do I squash (or just plain remove) all commits except for the most recent ones, and do so without having to manually squash each one in an interactive rebase? Specifically, I don't want to have to use

git rebase -i --root

For example, I have these commits:

A .. B .. C ... ... H .. I .. J .. K .. L

What I want is this (squashing everything in between A and H into A):

A .. H .. I .. J .. K .. L

Or even this would work fine:

H .. I .. J .. K .. L

There is an answer on how to squash all commits, but I want to keep some of the more recent commits. I don't want to squash the most recent commits either. (Especially I need to keep the first two commits counting from the top.)

(Edit, several years later. The right answer to this question is to use the right tool for the job. Git is not a very good tool to store backups, no matter how convenient it is. There are better tools.)

回答1:

The original poster comments:

if we take a snapshot of a commit 10004, remove all commits before it, and make commit 10004 a root commit, I'll be just fine

One way to do this is here, assuming your current work is called branchname. I like to use a temp tag whenever I do a large rebase to double-check that there were no changes and to mark a point I can reset back to if something goes wrong (not sure if this is standard procedure or not but it works for me):

git tag temp

git checkout 10004
git checkout --orphan new_root
git commit -m "set new root 10004"

git rebase --onto new_root 10004 branchname

git diff temp   # verification that it worked with no changes
git tag -d temp
git branch -D new_root

To get rid of the old branch you'll need to delete all tags and branch tags on it; then

git prune
git gc

will clean it from your repo.

Note that you'll temporarily have two copies of everything, until you have gc'd, but that is unavoidable; even if you do a standard squash and rebase you still have two copies of everything until the rebase finishes.

回答2:

Fastest counting implementation time is almost certainly going to be with grafts and a filter-branch, though you might be able to get faster execution with a handrolled commit-tree sequence working off rev-list output.

Rebase is built to apply changes on different content. What you're doing here is preserving contents and intentionally losing the change history that produced them, so pretty much all of rebase's most tedious and slow work is wasted.

The payload here is, working from your picture,

echo `git rev-parse H; git rev-parse A` > .git/info/grafts  
git filter-branch -- --all

_{Documentation for git rev-parse and git filter-branch.}

Filter-branch is very careful to be recoverable after a failure at any point, which is certainly safest .... but it's only really helpful when recovery by simply redoing it wouldn't be faster and easier if things go south on you. Failures being rare and restarts usually being cheap, the thing to do is to do an un"safe" but very fast operation that is all but certain to work. For that, the best option here is to do it on a tmpfs (the closest equivalent I know on Windows would be a ramdisk like ImDisk), which will be blazing fast and won't touch your main repo until you're sure you've got the results you want.

So on Windows, say T:\wip is on a ramdisk, and note that the clone here copies nothing. As well as reading the docs on git clone's --shared option, do examine the clone's innards to see the real effect, it's very straightforward.

# switch to a lightweight wip clone on a tmpfs
git clone --shared --no-checkout . /t/wip/filterwork
cd !$

# graft out the unwanted commits
echo `git rev-parse $L; git rev-parse $A` >.git/info/grafts
git filter-branch -- --all

# check that the repo history looks right
git log --graph --decorate --oneline --all

# all done with the splicing, filter-branch has integrated it
rm .git/info/grafts

# push the rewritten histories back
git push origin --all --force

There are enough possible variations on what you might be wanting to do and what might be in your repo that almost any of the options on these commands might be useful. The above is tested and will do what it says it does, but that might not be exactly what you want.

回答3:

An XY Problem

Note that the original poster has an XY problem, where he's trying to figure out how to squash his older commits (the Y problem), when his real problem is actually trying to reduce the size of his Git repository (the X problem), as I've mentioned in the comments:

Having a lot of commits won't necessarily bloat the size of your Git repo. Git is very efficient at compressing text-based files. Are you sure that the number of commits is the actual problem that leads to your large repo size? A more likely candidate is that you have too many binary assets versioned, which Git doesn't compress as well (or at all) compared to plain text files.

Despite this, for the sake of completeness, I will also add an alternative solution to Matt McNabb's answer to the Y problem.

Squashing (Hundreds or Thousands) of Old Commits

As the original poster has already noted, using an interactive rebase with the --root flag can be impractical when there are many commits (numbering in the hundreds or thousands), particularly since the interactive rebase won't run efficiently on such a large number of them.

As Matt McNabb pointed out in his answer, one solution is to use an orphan branch as a new (squashed) root, then to rebase on top of that. Another solution is to use a couple of various resets of the branch to achieve the same effect:

# Save the current state of the branch in a couple of other branches
git branch beforeReset
git branch verification

# Also mark where we want to start squashing commits
git branch oldBase <most_recent_commit_to_squash>

# Temporarily remove the most recent commits from the current branch,
# because we don't want to squash those:
git reset --hard oldBase

# Using a soft reset to the root commit will keep all of the changes
# staged in the index, so you just need to amend those changes to the
# root commit:
git reset --soft <root_commit>
git commit --amend

# Rebase onto the new amended root,
# starting from oldBase and going up to beforeReset
git rebase --onto master oldBase beforeReset

# Switch back to master and (fast-forward) merge it with beforeReset
git checkout master
git merge beforeReset

# Verify that master still contains the same state as before all of the resets
git diff verification

# Cleanup
git branch -D beforeReset oldBase verification

# As part of cleanup, since the original poster mentioned that
# he has a lot of commits that he wants to remove to reduce
# the size of his repo, garbage collect the old, dangling commits too
git gc --prune=all

The --prune=all option to git gc will ensure that all dangling commits are garbage collected, not only just the ones that are older than 2 weeks, which is the default setting for git gc.

来源：https://stackoverflow.com/questions/24153548/how-do-i-reduce-the-size-of-a-bloated-git-repo-by-non-interactively-squashing-al

标签

git

rebase

git-rebase

git-rewrite-history