Iconv: EILSEQ with ASCII//IGNORE but not with ASCII//TRANSLIT//IGNORE

问题

Using iconv with //TRANSLIT//IGNORE to convert from utf8 to ascii works fine; it replaces the non-convertible characters to a proper transliteration according to the current locale (de_DE in my case):

> echo 'möp' | iconv -f 'UTF8' -t 'ASCII//TRANSLIT//IGNORE'
moep

However, when just using //IGNORE without //TRANSLIT, it throws an error:

> echo 'möp' | iconv -f 'UTF8' -t 'ASCII//IGNORE'
mp
iconv: illegal input sequence at position 5

I wonder why that happens. The input sequence is exactly the same and shouldn't //IGNORE simply skip invalid characters? When using the iconv C api, I get an EILSEQ error - so basically I don't know if the input string contained invalid UTF8 or not...

回答1:

The manual page for iconv(1) on linux says the following:

  -t to-encoding, --to-code=to-encoding
         Use to-encoding for output characters.

         If the string //IGNORE is appended to to-encoding, characters
         that cannot  be  converted  are discarded and an error is printed 
         after conversion.

It does skip the character, but also raises the error at the end.

It seems that by using //IGNORE you really cannot distinguish between cases with invalid characters in the input and non-convertible characters. In other words the EILSEQ and EINVAL situations are handled the same.

回答2:

It is possible to distinguish between presence of illegal sequence in input text and some characters being dropped by examining reported offset of the illegal byte in the input sequence:

when input indeed contains illegal sequence offset value will be in 1 … (input_bytes_count) range
when input was fine but some characters were dropped illegal sequence offset value will be equal to input_bytes_count + 1

möp size is 4 bytes so reported illegal sequence offset of 5 indicates that input was ok, but some symbols were dropped because they couldn't be represented in the target encoding.

来源：https://stackoverflow.com/questions/9249628/iconv-eilseq-with-ascii-ignore-but-not-with-ascii-translit-ignore

标签

iconv