How to detect illegal UTF-8 byte sequences to replace them in java inputstream?

后端 未结 3 409
独厮守ぢ
独厮守ぢ 2021-02-02 15:22

The file in question is not under my control. Most byte sequences are valid UTF-8, it is not ISO-8859-1 (or an other encoding). I want to do my best do extract as much informat

3条回答
  •  南方客
    南方客 (楼主)
    2021-02-02 15:49

    One way would be to read the first few bytes to check Byte Order Mark (if exists). More information on BOM: http://en.wikipedia.org/wiki/Byte_order_mark In the given url, you will find a table of the BOM bytes. However, one problem is, UTF-8 does not require to use BOM in its' header. There is another way to solve the problem is by pattern recognition (read few bytes-8 bits each time). Anyway, this is the complicated solution..

提交回复
热议问题