GNU diff doesn\'t seem to be smart enough to detect and handle UTF-16 files, which surprises me. Am I missing an obvious command-line option? Is there a good alternative?<
Install ripgrep utility which supports UTF-16, then run:
diff <(rg -N . file1.txt) <(rg -N . file2.txt)
ripgrep
supports searching files in text encodings other than UTF-8, such as UTF-16, latin-1, GBK, EUC-JP, Shift_JIS and more. (Some support for automatically detecting UTF-16 is provided. Other text encodings must be specifically specified with the-E
/--encoding flag.
)
From the GNU diff documentation:
Handling Multibyte and Varying-Width Characters
diff, diff3 and sdiff treat each line of input as a string of unibyte characters. This can mishandle multibyte characters in some cases. For example, when asked to ignore spaces, diff does not properly ignore a multibyte space character.
Also, diff currently assumes that each byte is one column wide, and this assumption is incorrect in some locales, e.g., locales that use UTF-8 encoding. This causes problems with the -y or --side-by-side option of diff.
These problems need to be fixed without unduly affecting the performance of the utilities in unibyte environments.
The IBM GNU/Linux Technology Center Internationalization Team has proposed some patches to support internationalized diff http://oss.software.ibm.com/developer/opensource/linux/patches/i18n/diffutils-2.7.2-i18n-0.1.patch.gz. Unfortunately, these patches are incomplete and are to an older version of diff, so more work needs to be done in this area.
I never realized that myself.
It looks like Guiffy could to the job if a nonfree, non-command line tool will do the job, still looking for a freeware command line tool:
http://www.guiffy.com/Diff-Tool.html
vimdiff
works quite nicely for this purpose.
I found it while reading this StackOverflow answer.
Malforms patches when accent marks or special characters are used:
diff --version
diff (GNU diffutils) 3.6
diff -Naur old_foo new_foo > foo.patch
Correctly handles accent marks or special characters regardless of whether compared files/dirs are in a git folder.
git --version
git version 2.17.1
git diff --no-index old_foo new_foo > foo.patch
In Python, you can use difflib.HtmlDiff to create an HTML table that shows the differences between two sequences of lines, and it seems to work fine with Unicode strings (provided, of course, you read and write them with the appropriate codecs).
>>> hd = difflib.HtmlDiff()
>>> htmldiff = hd.make_file(codecs.open('file1', 'r', 'utf-16').readlines(), codecs.open('file2', 'r', 'utf-16').readlines())
>>> print >> codecs.open('diff.html', 'w', 'utf-16'), htmldiff
You could maybe build something in python with the excellent chardet, then convert your files to UTF-8 and send this to GNU diff ?
http://chardet.feedparser.org/