How to filter history based on gitignore?

前端 未结 3 1353
逝去的感伤
逝去的感伤 2020-12-06 07:37

To be clear on this question, I am not asking about how to remove a single file from history, like this question: Completely remove file from all Git repository com

相关标签:
3条回答
  • 2020-12-06 08:12

    Achieving the result you want is a bit tricky. The simplest way, using git filter-branch with a --tree-filter, will be very slow. Edit: I've modified your example script to do this; see the end of this answer.

    First, let's note one constraint: you can never change any existing commit. All you can do is make new commits that look a lot like the old ones, but "new and improved". You then direct Git to stop looking at the old commits, and look only at the new ones. This is what we will do here. (Then, if required, you can force Git to really forget the old commits. The easiest way is to re-clone the clone.)

    Now, to re-commit every commit that is reachable from one or more branch and/or tag names, preserving everything except that which we explicitly tell it to change,1 we can use git filter-branch. The filter-branch command has a rather dizzying array of filtering options, most of which are meant to make it go faster, because copying every commit is pretty slow. If there are just a few hundred commits in a repository, with a few dozens or hundreds of files each, it's not so bad; but if there are about 100k commits holding about 100k files each, that's ten thousand million files (10,000,000,000 files) to examine and re-commit. It is going to take a while.

    Unfortunately there is no easy and convenient way to speed this up. The best way to speed it up would be to use an --index-filter, but there is no built in index filter command that will do what you want. The easiest filter to use is --tree-filter, which is also the slowest one there is. You might want to experiment with writing your own index filter, perhaps in shell script or perhaps in another language you prefer (you will still need to invoke git update-index either way).


    1Signed annotated tags cannot be preserved intact, so their signatures will be stripped. Signed commits may have their signatures become invalid (if the commit hash changes, which depends on whether it must: remember that the hash ID of a commit is the checksum of the commit's contents, so if the set of files changes, the checksum changes; but if the checksum of a parent commit changes, the checksum of this commit also changes).


    Using --tree-filter

    When you use git filter-branch with --tree-filter, what the filter-branch code does is to extract each commit, one at a time, into a temporary directory. This temporary directory has no .git directory and is not where you are running git filter-branch (it's actually in a subdirectory of the .git directory unless you use the -d option to redirect Git to, say, a memory filesystem, which is a good idea for speeding it up).

    After extracting the entire commit into this temporary directory, Git runs your tree-filter. Once your tree-filter finishes, Git packages up everything in that temporary directory into the new commit. Whatever you leave there, is in. Whatever you add to there, is added. Whatever you modify there, is modified. Whatever you remove from there, is no longer in the new commit.

    Note that a .gitignore file in this temporary directory has no effect on what will be committed (but the .gitignore file itself will be committed, since whatever is in the temporary directory becomes the new copy-commit). So if you want to be sure that a file of some known path is not committed, simply rm -f known/path/to/file.ext. If the file was in the temporary directory, it is now gone. If not, nothing happens and all is well.

    Hence, a workable tree filter would be:

    rm -f $(cat /tmp/files-to-remove)
    

    (assuming no white space issues in file names; use xargs ... | rm -f to avoid white space issues, with whatever encoding you like for the xargs input; -z style encoding is ideal since \0 is forbidden in path names).

    Converting this to an index filter

    Using an index filter lets Git skip the extract-and-examine phases. If you had a fixed "remove" list in the right form, it would be easy to use.

    Let's say you have the file names in /tmp/files-to-remove in a form that is suitable for xargs -0. Your index filter might then read, in its entirety:

    xargs -0 /tmp/files-to-remove | git rm --cached -f --ignore-unmatch
    

    which is basically the same as the rm -f above, but works within the temporary index Git uses for each commit-to-be-copied. (Add -q to the git rm --cached to make it quiet.)

    Applying .gitignore files in a tree filter

    Your example script tries to use a --tree-filter after rebasing onto an initial commit that has the desired items:

    git filter-branch --tree-filter 'git clean -f -X' -- --all
    

    There is one initial bug though (the git rebase is wrong):

    -git rebase --onto temp master
    +git rebase --onto temp temp master
    

    Fixing that, the thing still doesn't work, and the reason is that git clean -f -X only removes files that are actually ignored. Any file that is already in the index, is not actually ignored.

    The trick is to empty out the index. However, this does too much: git clean then never descends into subdirectories—so the trick comes in two parts: empty out the index, then re-fill it with non-ignored files. Now git clean -f -X will remove the remaining files:

    -git filter-branch --tree-filter 'git clean -f -X' -- --all
    +git filter-branch --tree-filter 'git rm --cached -qrf . && git add . && git clean -fqX' -- --all
    

    (I added several "quiet" flags here).

    To avoid needing to rebase in the first place to install initial .gitignore files, let's say you have a master set of .gitignore files you want in every commit (which we'll then use in the tree filter as well). Simply place these, and nothing else, in a temporary tree:

    mkdir /tmp/ignores-to-add
    cp .gitignore /tmp/ignores-to-add
    mkdir /tmp/ignores-to-add/main
    cp main/.gitignore /tmp/ignores-to-add
    

    (I'll leave working up a script that finds and copies just .gitignore files to you, it seems moderately annoying to do without one). Then, for the --tree-filter, use:

    cp -R /tmp/ignores-to-add . &&
        git rm --cached -qrf . &&
        git add . &&
        git clean -fqX
    

    The first step, cp -R (which can be done anywhere before the git add ., really), installs the correct .gitignore files. Since we do this to each commit, we never need to rebase before running filter-branch.

    The second removes everything from the index. (A slightly faster method is just rm $GIT_INDEX_FILE but it's not guaranteed that this will work forever.)

    The third re-adds ., i.e., everything in the temporary tree. Since the .gitignore files are in place, we only add non-ignored files.

    The last step, git clean -qfX, removes work-tree files that are ignored, so that filter-branch won't put them back.

    0 讨论(0)
  • 2020-12-06 08:29

    This method makes git completely forget ignored files (past/present/future), but does not delete anything from working directory (even when re-pulled from remote).

    This method requires usage of /.git/info/exclude (preferred) OR a pre-existing .gitignore in all the commits that have files to be ignored/forgotten. 1

    All methods of enforcing git ignore behavior after-the-fact effectively re-write history and thus have significant ramifications for any public/shared/collaborative repos that might be pulled after this process. 2

    General advice: start with a clean repo - everything committed, nothing pending in working directory or index, and make a backup!

    Also, the comments/revision history of this answer (and revision history of this question) may be useful/enlightening.

    #commit up-to-date .gitignore (if not already existing)
    #this command must be run on each branch
    
    git add .gitignore
    git commit -m "Create .gitignore"
    
    #apply standard git ignore behavior only to current index, not working directory (--cached)
    #if this command returns nothing, ensure /.git/info/exclude AND/OR .gitignore exist
    #this command must be run on each branch
    
    git ls-files -z --ignored --exclude-standard | xargs -0 git rm --cached
    
    #Commit to prevent working directory data loss!
    #this commit will be automatically deleted by the --prune-empty flag in the following command
    #this command must be run on each branch
    
    git commit -m "ignored index"
    
    #Apply standard git ignore behavior RETROACTIVELY to all commits from all branches (--all)
    #This step WILL delete ignored files from working directory UNLESS they have been dereferenced from the index by the commit above
    #This step will also delete any "empty" commits.  If deliberate "empty" commits should be kept, remove --prune-empty and instead run git reset HEAD^ immediately after this command
    
    git filter-branch --tree-filter 'git ls-files -z --ignored --exclude-standard | xargs -0 git rm -f --ignore-unmatch' --prune-empty --tag-name-filter cat -- --all
    
    #List all still-existing files that are now ignored properly
    #if this command returns nothing, it's time to restore from backup and start over
    #this command must be run on each branch
    
    git ls-files --other --ignored --exclude-standard
    

    Finally, follow the rest of this GitHub guide (starting at step 6) which includes important warnings/information about the commands below.

    git push origin --force --all
    git push origin --force --tags
    git for-each-ref --format="delete %(refname)" refs/original | git update-ref --stdin
    git reflog expire --expire=now --all
    git gc --prune=now
    

    Other devs that pull from now-modified remote repo should make a backup and then:

    #fetch modified remote
    
    git fetch --all
    
    #"Pull" changes WITHOUT deleting newly-ignored files from working directory
    #This will overwrite local tracked files with remote - ensure any local modifications are backed-up/stashed
    #Switching branches after this procedure WILL LOOSE all newly-gitignored files in working directory because they are no longer tracked when switching branches
    
    git reset FETCH_HEAD
    

    Footnotes

    1 Because /.git/info/exclude can be applied to all historical commits using the instructions above, perhaps details about getting a .gitignore file into the historical commit(s) that need it is beyond the scope of this answer. I wanted a proper .gitignore to be in the root commit, as if it was the first thing I did. Others may not care since /.git/info/exclude can accomplish the same thing regardless where the .gitignore exists in the commit history, and clearly re-writing history is a very touchy subject, even when aware of the ramifications.

    FWIW, potential methods may include git rebase or a git filter-branch that copies an external .gitignore into each commit, like the answers to this question

    2 Enforcing git ignore behavior after-the-fact by committing the results of a standalone git rm --cached command may result in newly-ignored file deletion in future pulls from the force-pushed remote. The --prune-empty flag in the following git filter-branch command avoids this problem by automatically removing the previous "delete all ignored files" index-only commit. Re-writing git history also changes commit hashes, which will wreak havoc on future pulls from public/shared/collaborative repos. Please understand the ramifications fully before doing this to such a repo. This GitHub guide specifies the following:

    Tell your collaborators to rebase, not merge, any branches they created off of your old (tainted) repository history. One merge commit could reintroduce some or all of the tainted history that you just went to the trouble of purging.

    Alternative solutions that do not affect the remote repo are git update-index --assume-unchanged </path/file> or git update-index --skip-worktree <file>, examples of which can be found here.

    0 讨论(0)
  • 2020-12-06 08:31

    On windows this sequence did not work to me:

    cp -R /tmp/ignores-to-add . &&
    git rm --cached -qrf . &&
    git add . &&
    git clean -fqX
    

    But following works.

    Update every commit with existed .gitignore:

    git filter-branch --index-filter '
      git ls-files -i --exclude-from=.gitignore | xargs git rm --cached -q 
    ' -- --all
    

    Update .gitignore in the every commit and filter files:

    cp ../.gitignore /d/tmp-gitignore
    git filter-branch --index-filter '
      cp /d/tmp-gitignore ./.gitignore
      git add .gitignore
      git ls-files -i --exclude-from=.gitignore | xargs git rm --cached -q 
    ' -- --all
    rm /d/tmp-gitignore
    

    Use grep -v if you had special cases, for example file empty to keep empty directory:

    git ls-files -i --exclude-from=.gitignore | grep -vE "empty$" | xargs git rm --cached -q
    
    0 讨论(0)
提交回复
热议问题