问题
I'm trying to remove sensitive data like passwords from my Git history. Instead of deleting whole files I just want to substitute the passwords with removedSensitiveInfo
. This is what I came up with after browsing through numerous StackOverflow topics and other sites.
git filter-branch --tree-filter "find . -type f -exec sed -Ei '' -e 's/(aSecretPassword1|aSecretPassword2|aSecretPassword3)/removedSensitiveInfo/g' {} \;"
When I run this command it seems to be rewriting the history (it shows the commits it's rewriting and takes a few minutes). However, when I check to see if all sensitive data has indeed been removed it turns out it's still there.
For reference this is how I do the check
git grep aSecretPassword1 $(git rev-list --all)
Which shows me all the hundreds of commits that match the search query. Nothing has been substituted.
Any idea what's going on here?
I double checked the regular expression I'm using which seems to be correct. I'm not sure what else to check for or how to properly debug this as my Git knowledge quite rudimentary. For example I don't know how to test whether 1) my regular expression isn't matching anything, 2) sed isn't being run on all files, 3) the file changes are not being saved, or 4) something else.
Any help is very much appreciated.
P.S. I'm aware of several StackOverflow threads about this topic. However, I couldn't find one that is about substituting words (rather than deleting files) in all (ASCII) files (rather than specifying a specific file or file type). Not sure whether that should make a difference, but all suggested solutions haven't worked for me.
回答1:
git-filter-branch
is a powerful but difficult to use tool - there are several obscure things you need to know to use it correctly for your task, and each one is a possible cause for the problems you're seeing. So rather than immediately trying to debug them, let's take a step back and look at the original problem:
- Substitute given strings (ie passwords) within all text files (without specifying a specific file/file-type)
- Ensure that the updated Git history does not contain the old password text
- Do the above as simply as possible
There is a tailor-made solution to this problem:
Use The BFG... not git-filter-branch
The BFG Repo-Cleaner is a simpler alternative to git-filter-branch
specifically designed for removing passwords and other unwanted data from Git repository history.
Ways in which the BFG helps you in this situation:
- The BFG is 10-720x faster
- It automatically runs on all tags and references, unlike
git-filter-branch
- which only does that if you add the extraordinary--tag-name-filter cat -- --all
command-line option (Note that the example command you gave in the Question DOES NOT have this, a possible cause of your problems) - The BFG doesn't generate any
refs/original/
refs - so no need for you to perform an extra step to remove them - You can express you passwords as simple literal strings, without having to worry about getting regex-escaping right. The BFG can handle regex too, if you really need it.
Using the BFG
Carefully follow the usage steps - the core bit is just this command:
$ java -jar bfg.jar --replace-text replacements.txt my-repo.git
The replacements.txt
file should contain all the substitutions you want to do, in a format like this (one entry per line - note the comments shouldn't be included):
PASSWORD1 # Replace literal string 'PASSWORD1' with '***REMOVED***' (default)
PASSWORD2==>examplePass # replace with 'examplePass' instead
PASSWORD3==> # replace with the empty string
regex:password=\w+==>password= # Replace, using a regex
Your entire repository history will be scanned, and all text files (under 1MB in size) will have the substitutions performed: any matching string (that isn't in your latest commit) will be replaced.
Full disclosure: I'm the author of the BFG Repo-Cleaner.
回答2:
Looks OK. Remember that filter-branch retains the original commits under refs/original/
, e.g.:
$ git commit -m 'add secret password, oops!'
[master edaf467] add secret password, oops!
1 file changed, 4 insertions(+)
create mode 100644 secret
$ git filter-branch --tree-filter "find . -type f -exec sed -Ei '' -e 's/(aSecretPassword1|aSecretPassword2|aSecretPassword3)/removedSensitiveInfo/g' {} \;"
Rewrite edaf467960ade97ea03162ec89f11cae7c256e3d (2/2)
Ref 'refs/heads/master' was rewritten
Then:
$ git grep aSecretPassword `git rev-list --all`
edaf467960ade97ea03162ec89f11cae7c256e3d:secret:aSecretPassword2
but:
$ git lola
* e530e69 (HEAD, master) add secret password, oops!
| * edaf467 (refs/original/refs/heads/master) add secret password, oops!
|/
* 7624023 Initial
(git lola
is my alias for git log --graph --oneline --decorate --all
). Yes, it's in there, but under the refs/original
name space. Clear that out:
$ rm -rf .git/refs/original
$ git reflog expire --expire=now --all
$ git gc
Counting objects: 6, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (4/4), done.
Writing objects: 100% (6/6), done.
Total 6 (delta 0), reused 0 (delta 0)
and then:
$ git grep aSecretPassword `git rev-list --all`
$
(as always, run filter-branch
on a copy of the repo Just In Case; and then removing original refs, expiring the reflog "now", and gc'ing, means stuff is Really Gone).
来源:https://stackoverflow.com/questions/19870861/how-to-substitute-words-in-git-history-properly-debug-related-problems