Iconv: EILSEQ with ASCII//IGNORE but not with ASCII//TRANSLIT//IGNORE

与世无争的帅哥 提交于 2019-12-12 12:15:00

问题


Using iconv with //TRANSLIT//IGNORE to convert from utf8 to ascii works fine; it replaces the non-convertible characters to a proper transliteration according to the current locale (de_DE in my case):

> echo 'möp' | iconv -f 'UTF8' -t 'ASCII//TRANSLIT//IGNORE'
moep

However, when just using //IGNORE without //TRANSLIT, it throws an error:

> echo 'möp' | iconv -f 'UTF8' -t 'ASCII//IGNORE'
mp
iconv: illegal input sequence at position 5

I wonder why that happens. The input sequence is exactly the same and shouldn't //IGNORE simply skip invalid characters? When using the iconv C api, I get an EILSEQ error - so basically I don't know if the input string contained invalid UTF8 or not...


回答1:


The manual page for iconv(1) on linux says the following:

  -t to-encoding, --to-code=to-encoding
         Use to-encoding for output characters.

         If the string //IGNORE is appended to to-encoding, characters
         that cannot  be  converted  are discarded and an error is printed 
         after conversion.

It does skip the character, but also raises the error at the end.

It seems that by using //IGNORE you really cannot distinguish between cases with invalid characters in the input and non-convertible characters. In other words the EILSEQ and EINVAL situations are handled the same.




回答2:


It is possible to distinguish between presence of illegal sequence in input text and some characters being dropped by examining reported offset of the illegal byte in the input sequence:

  • when input indeed contains illegal sequence offset value will be in 1 … (input_bytes_count) range
  • when input was fine but some characters were dropped illegal sequence offset value will be equal to input_bytes_count + 1

möp size is 4 bytes so reported illegal sequence offset of 5 indicates that input was ok, but some symbols were dropped because they couldn't be represented in the target encoding.



来源:https://stackoverflow.com/questions/9249628/iconv-eilseq-with-ascii-ignore-but-not-with-ascii-translit-ignore

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!