How do I diff utf-16 files with GNU diff?

后端 未结 6 1843
一向
一向 2021-01-01 15:07

GNU diff doesn\'t seem to be smart enough to detect and handle UTF-16 files, which surprises me. Am I missing an obvious command-line option? Is there a good alternative?<

相关标签:
6条回答
  • 2021-01-01 15:41

    Install ripgrep utility which supports UTF-16, then run:

    diff <(rg -N . file1.txt) <(rg -N . file2.txt)
    

    ripgrep supports searching files in text encodings other than UTF-8, such as UTF-16, latin-1, GBK, EUC-JP, Shift_JIS and more. (Some support for automatically detecting UTF-16 is provided. Other text encodings must be specifically specified with the -E/--encoding flag.)

    0 讨论(0)
  • 2021-01-01 15:47

    From the GNU diff documentation:

    Handling Multibyte and Varying-Width Characters

    diff, diff3 and sdiff treat each line of input as a string of unibyte characters. This can mishandle multibyte characters in some cases. For example, when asked to ignore spaces, diff does not properly ignore a multibyte space character.

    Also, diff currently assumes that each byte is one column wide, and this assumption is incorrect in some locales, e.g., locales that use UTF-8 encoding. This causes problems with the -y or --side-by-side option of diff.

    These problems need to be fixed without unduly affecting the performance of the utilities in unibyte environments.

    The IBM GNU/Linux Technology Center Internationalization Team has proposed some patches to support internationalized diff http://oss.software.ibm.com/developer/opensource/linux/patches/i18n/diffutils-2.7.2-i18n-0.1.patch.gz. Unfortunately, these patches are incomplete and are to an older version of diff, so more work needs to be done in this area.

    I never realized that myself.

    It looks like Guiffy could to the job if a nonfree, non-command line tool will do the job, still looking for a freeware command line tool:

    http://www.guiffy.com/Diff-Tool.html

    0 讨论(0)
  • 2021-01-01 15:54

    vimdiff works quite nicely for this purpose.

    I found it while reading this StackOverflow answer.

    0 讨论(0)
  • 2021-01-01 15:56

    Malforms patches when accent marks or special characters are used:

     diff --version
     diff (GNU diffutils) 3.6
     diff -Naur old_foo new_foo > foo.patch
    

    Correctly handles accent marks or special characters regardless of whether compared files/dirs are in a git folder.

     git --version
     git version 2.17.1
     git diff --no-index old_foo new_foo > foo.patch
    
    0 讨论(0)
  • 2021-01-01 15:58

    In Python, you can use difflib.HtmlDiff to create an HTML table that shows the differences between two sequences of lines, and it seems to work fine with Unicode strings (provided, of course, you read and write them with the appropriate codecs).

    >>> hd = difflib.HtmlDiff()
    >>> htmldiff = hd.make_file(codecs.open('file1', 'r', 'utf-16').readlines(), codecs.open('file2', 'r', 'utf-16').readlines())
    >>> print >> codecs.open('diff.html', 'w', 'utf-16'), htmldiff
    
    0 讨论(0)
  • 2021-01-01 16:01

    You could maybe build something in python with the excellent chardet, then convert your files to UTF-8 and send this to GNU diff ?

    http://chardet.feedparser.org/

    0 讨论(0)
提交回复
热议问题