Quantifying the amount of change in a git diff?

眉间皱痕 提交于 2019-11-28 18:38:09

wdiff does word-by-word comparison. Git can be configured to use an external program to do the diffing. Based on those two facts and this blog post, the following should do roughly what you want.

Create a script to ignore most of the unnecessary arguments that git-diff provides and pass them to wdiff. Save the following as ~/wdiff.py or something similar and make it executable.

#!/usr/bin/python

import sys
import os

os.system('wdiff -s3 "%s" "%s"' % (sys.argv[2], sys.argv[5]))

Tell git to use it.

git config --global diff.external ~/wdiff.py
git diff filename

git diff --word-diff works in the latest stable version of git (at git-scm.com)

There are a few options that let you decide what format you want it in, the default is quite readable but you might want --word-diff=porcelain if you're feeding the output into a script.

Stoutie

Building on James' and cornmacrelf's input, I've added arithmetic expansion, and came up with a few reusable alias commands for counting words in a git diff:

alias gitwa='git diff --word-diff=porcelain origin/master | grep -e "^+[^+]" | wc -w | xargs'
alias gitwd='git diff --word-diff=porcelain origin/master | grep -e "^-[^-]" | wc -w | xargs'
alias gitw='echo $(($(gitwa) - $(gitwd)))'

Output from gitwa and gitwd is trimmed using xargs trick.

I figured out a way to get concrete numbers by building on top of the other answers here. The result is an approximation, but it should be close enough to serve as a useful indicator of the amount characters that were added or removed. Here's an example with my current branch compared to origin/master:

$ git diff --word-diff=porcelain origin/master | grep -e '^+[^+]' | wc -m
38741
$ git diff --word-diff=porcelain origin/master | grep -e '^-[^-]' | wc -m
46664

The difference between the removed characters (46664) and the added characters (38741) shows that my current branch has removed approximately 7923 characters. Those individual added/removed counts are inflated due to the diff's +/- and indentation characters, however, the difference should cancel out a significant portion of that inflation in most cases.

Git has had (for a long time) a --color-words option for git diff. This doesn't get you your counting, but it does let you see the diffs.

scompt.com's suggestion of wdiff is also good; it's pretty easy to shove in a different differ (see git-difftool). From there you just have to go from the output wdiff can give to the result you really want.

There's one more exciting thing to share, though, from git's what's cooking:

* tr/word-diff (2010-04-14) 1 commit
  (merged to 'next' on 2010-05-04 at d191b25)
 + diff: add --word-diff option that generalizes --color-words

Here's the commit introducing word-diff. Presumably it will make its way from next into master before long, and then git will be able to do this all internally - either producing its own word diff format or something similar to wdiff. If you're daring, you could build git from next, or just merge that one commit into your local master to build.

Thanks to Jakub's comment: you can further customize word diffs if necessary by providing a word regex (config parameter diff.*.wordRegex), documented in gitattributes.

Anthony Panozzo

I liked Stoutie's answer and wanted to make it a bit more configurable to answer some word count questions I had. Ended up with the following solution that works in ZSH and should work in Bash. Each function takes any revision or revision difference, with a default of comparing the current state of the world with origin/master:


# Calculate writing word diff between revisions. Cribbed / modified from:
# https://stackoverflow.com/questions/2874318/quantifying-the-amount-of-change-in-a-git-diff
function git_words_added {
  revision=${1:-origin/master}

  git diff --word-diff=porcelain $revision | \
    grep -e "^+[^+]" | \
    wc -w | \
    xargs
}

function git_words_removed {
  revision=${1:-origin/master}

  git diff --word-diff=porcelain $revision | \
    grep -e "^-[^-]" | \
    wc -w | \
    xargs
}

function git_words_diff {
  revision=${1:-origin/master}

  echo $(($(git_words_added $1) - $(git_words_removed $1)))
}

Then you can use it like so:


$ git_words_added
# => how many words were added since origin/master

$ git_words_removed
# => how many words were removed since origin/master

$ git_words_diff
# => difference of adds and removes since origin/master (net words)

$ git_words_diff HEAD
# => net words since you last committed

$ git_words_diff master@{yesterday}
# => net words written today!

$ git_words_diff HEAD^..HEAD
# => net words in the last commit

$ git_words_diff ABC123..DEF456
# => net words between two arbitrary commits

Hope this helps someone!

Sorry, I don't have enough reputation points to comment on @codebeard's answer. It is the one I used, and I added both of his versions to my .gitconfig file. They gave different answers, and I traced the problem to wdiff -sd in the second version (the one that combines all modified files together) counting the words in the two lines at the top of the output of diff -pdrU3. It will be something like:

--- 1   2018-12-10 22:53:47.838902415 -0800
+++ 2   2018-12-10 22:53:57.674835179 -0800

I fixed this by piping through tail -n +4.

Here's my full .gitconfig settings with the fix in place:

[alias]
    wdiff = diff
    wdiffs = difftool -t wdiffs
    wdiffs-all = difftool -d -t wdiffs-all
[difftool "wdiffs"]
    cmd = wdiff -n -s \"$LOCAL\" \"$REMOTE\" | colordiff
[difftool "wdiffs-all"]
    cmd = diff -pdrU3 \"$LOCAL\" \"$REMOTE\" | tail -n +4 | wdiff -sd

If you'd rather use git config here are the commands:

git config --global difftool.wdiffs.cmd 'wdiff -n -s "$LOCAL" "$REMOTE"' | colordiff
git config --global alias.wdiffs 'difftool -t wdiffs'
git config --global difftool.wdiffs-all.cmd 'diff -pdrU3 "$LOCAL" "$REMOTE" | wdiff -sd'
git config --global alias.wdiffs-all 'difftool -d -t wdiffs-all'

Now, you can do git wdiffs or git wdiffs-all to get your word count since the last commit.

To compare to origin/master, do git wdiffs origin/master or git wdiffs-all origin/master.

I like this answer the best because it gives both the word count and the diff, and if you pipe through colordiff, it comes out nice and colored. (@Miles answer is also good but requires you to figure out what time to use. However, I like the idea of looking for moved text.)

wdiff's stats output at the end looks like this:

file1.txt: 12360 words  12360 100% common  0 0% deleted  5 0% changed
file2.txt: 12544 words  12360 99% common  184 1% inserted  11 0% changed

To find out how many words you have added, add inserted and changed from the second line, 184+11, in the example above.

Why not anything from the first line? Answer: those are words removed.

Here's a bash script to get a single, unified word count:

wdiffoutput=$(git wdiffs-all | tail -n 1)
wdiffins=$(echo "$wdiffoutput" | grep -oP "common *\K\d*")
wdiffchg=$(echo "$wdiffoutput" | grep -oP "inserted *\K\d*")
echo "Word Count: $((wdiffins+wdiffchg))"

Since Git 1.6.3 there is also git difftool, which can be configured to run nearly any external diff tool. This is a lot easier than some of the solutions which require creating scripts etc. If you like the output of wdiff -s you can configure something like:

git config --global difftool.wdiffs.cmd 'wdiff -s "$LOCAL" "$REMOTE"'
git config --global alias.wdiffs 'difftool -t wdiffs'

Now you can just run git difftool -t wdiffs or its alias git wdiffs.

If you prefer to get statistics for all modified files together, instead do something like:

git config --global difftool.wdiffs.cmd 'diff -pdrU3 "$LOCAL" "$REMOTE" | wdiff -sd'
git config --global alias.wdiffs 'difftool -d -t wdiffs'

This takes the output of a typical unified diff and pipes it into wdiff with its -d option set to just interpret the input. In contrast, the extra -d argument to difftool in the alias tells git to copy all modified files to a temporary directory before doing the diff.

The above answers fail for some use cases where you need to exclude moved text (e.g., if I move a function in code or paragraph in latex further down the document, I don't want to count all of those as changes!)

For that, you can also calculate the number of duplicate lines, and exclude those from your query if there are too many duplicates.

For example, building on the other answers, I can do:

git diff $sha~1..$sha|grep -e"^+[^+]" -e"^-[^-]"|sed -e's/.//'|sort|uniq -d|wc -w|xargs

calculates the number of duplicate words in the diff, where sha is your commit.

You can do this for all the commits within the last day (since 6 am) by:

for sha in $(git rev-list --since="6am" master | sed -e '$ d'); do
     echo $(git diff --word-diff=porcelain $sha~1..$sha|grep -e"^+[^+]"|wc -w|xargs),\
     $(git diff --word-diff=porcelain $sha~1..$sha|grep -e"^-[^-]"|wc -w|xargs),\
     $(git diff $sha~1..$sha|grep -e"^+[^+]" -e"^-[^-]"|sed -e's/.//'|sort|uniq -d|wc -w|xargs)
done

Prints: added, deleted, duplicates

(I take the line diff for duplicates, as it excludes the times where git diff tries to be too clever, and assumes you have actually just changed text rather than moved it. It also discounts instances where a single word is counted as a duplicate.)

Or, if you want to be sophisticated about it, you can exclude commits entirely if there is more than 80% duplication, and sum up the rest:

total=0
for sha in $(git rev-list --since="6am" master | sed -e '$ d'); do
    added=$(git diff --word-diff=porcelain $sha~1..$sha|grep -e"^+[^+]"|wc -w|xargs)
    deleted=$(git diff --word-diff=porcelain $sha~1..$sha|grep -e"^-[^-]"|wc -w|xargs)
    duplicated=$(git diff $sha~1..$sha|grep -e"^+[^+]" -e"^-[^-]"|sed -e's/.//'|sort|uniq -d|wc -w|xargs)
    if [ "$added" -eq "0" ]; then
        changed=$deleted
        total=$((total+deleted))
        echo "added:" $added, "deleted:" $deleted, "duplicated:"\
             $duplicated, "changed:" $changed
    elif [ "$(echo "$duplicated/$added > 0.8" | bc -l)" -eq "1" ]; then
        echo "added:" $added, "deleted:" $deleted, "duplicated:"\
             $duplicated, "changes counted:" 0
    else
        changed=$((added+deleted))
        total=$((total+changed))
        echo "added:" $added, "deleted:" $deleted, "duplicated:"\
             $duplicated, "changes counted:" $changed
    fi
done
echo "Total changed:" $total

I have this script to do it here: https://github.com/MilesCranmer/git-stats.

This prints out:

➜  bifrost_paper git:(master) ✗ count_changed_words "6am" 

added: 38, deleted: 76, duplicated: 3, changes counted: 114
added: 14, deleted: 19, duplicated: 0, changes counted: 33
added: 1113, deleted: 1112, duplicated: 1106, changes counted: 0
added: 1265, deleted: 1275, duplicated: 1225, changes counted: 0
added: 4207, deleted: 4208, duplicated: 4391, changes counted: 0
Total changed: 147

The commits where I am just moving around things are obvious, so I don't count those changes. It counts up everything else and tells me the total number of changed words.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!