问题
I'm trying to use the following command on a text file:
$ sort <m.txt | uniq -c | sort -nr >m.dict
However I get the following error message:
sort: string comparison failed: Invalid or incomplete multibyte or wide character
sort: Set LC_ALL='C' to work around the problem.
sort: The strings compared were ‘enwedig\r’ and ‘mwy\r’.
I'm using Cygwin on Windows 7 and was having trouble earlier editing m.txt to put each word within the file on a new line. Please see:
Using AWK to place each word in a text file on a new line
I'm not sure if I'm getting these errors due to this, or because m.txt contains characters from the Welsh alphabet (When I was working with Welsh text in Python, I was required t change the encoding to 'Latin-1').
I tried following the error message's advice and changing LC_ALL='C' however this has not helped. Can anyone elaborate on the errors I'm receiving and provide any advice on how I might go about trying to solve this problem.
UPDATE:
When trying dos2unix, errors were being displayed about invalid characters at certain lines. It turns out these were not Welsh characters, but other strange characters (arrows etc). I went through my text file removing these characters until I was able to use the dos2unix command without error. However, after using the dos2unix command all the text was concatenated (no spaces/newlines or anything, whereas it should have been so that each word in the file was on a seperate line) I then used unix2dos and the text file was back to normal. How can I each word on its own individual line and use the sort command without it giving me errors about '\r' characters?
回答1:
I know it's an old question, but just running the command export LC_ALL='C'
does the trick as described by sort: Set LC_ALL='C' to work around the problem.
.
回答2:
Looks like a Windows line-ending related problem (\r\n
versus \n
). You can convert m.txt
to Unix line-endings with
dos2unix m.txt
and then rerun your command.
来源:https://stackoverflow.com/questions/36292307/sort-string-comparison-failed-invalid-or-incomplete-multibyte-or-wide-character