问题
Using iconv with //TRANSLIT//IGNORE
to convert from utf8 to ascii works fine; it replaces the non-convertible characters to a proper transliteration according to the current locale (de_DE in my case):
> echo 'möp' | iconv -f 'UTF8' -t 'ASCII//TRANSLIT//IGNORE'
moep
However, when just using //IGNORE
without //TRANSLIT
, it throws an error:
> echo 'möp' | iconv -f 'UTF8' -t 'ASCII//IGNORE'
mp
iconv: illegal input sequence at position 5
I wonder why that happens. The input sequence is exactly the same and shouldn't //IGNORE
simply skip invalid characters?
When using the iconv C api, I get an EILSEQ error - so basically I don't know if the input string contained invalid UTF8 or not...
回答1:
The manual page for iconv(1) on linux says the following:
-t to-encoding, --to-code=to-encoding Use to-encoding for output characters. If the string //IGNORE is appended to to-encoding, characters that cannot be converted are discarded and an error is printed after conversion.
It does skip the character, but also raises the error at the end.
It seems that by using //IGNORE you really cannot distinguish between cases with invalid characters in the input and non-convertible characters. In other words the EILSEQ and EINVAL situations are handled the same.
回答2:
It is possible to distinguish between presence of illegal sequence in input text and some characters being dropped by examining reported offset of the illegal byte in the input sequence:
- when input indeed contains illegal sequence offset value will be in
1 … (input_bytes_count)
range - when input was fine but some characters were dropped illegal sequence offset value will be equal to
input_bytes_count + 1
möp
size is 4 bytes so reported illegal sequence offset of 5 indicates that input was ok, but some symbols were dropped because they couldn't be represented in the target encoding.
来源:https://stackoverflow.com/questions/9249628/iconv-eilseq-with-ascii-ignore-but-not-with-ascii-translit-ignore