How to detect illegal UTF-8 byte sequences to replace them in java inputstream?

后端未结

关注

 3  413

The file in question is not under my control. Most byte sequences are valid UTF-8, it is not ISO-8859-1 (or an other encoding). I want to do my best do extract as much informat

相关标签:

3条回答

梦谈多话

2021-02-02 15:36

java.nio.charset.CharsetDecoder does what you need. This class provides charset decoding with user-definable actions on different kinds of errors (see onMalformedInput() and onUnmappableCharacter()).

CharsetDecoder writes to an OutputStream, which you can pipe into an InputStream using java.io.PipedOutputStream, effectively creating a filtered InputStream.

0 讨论(0)
发布评论:

提交评论
- 加载中...
走了就别回头了

2021-02-02 15:43
The behaviour you want is already the default for InputStreamReader. So there is no need to specify it yourself. This suffices:
```
final BufferedInputStream in = new BufferedInputStream(istream);
final Reader inputReader = new InputStreamReader(in, StandardCharsets.UTF_8);
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
南方客

2021-02-02 15:49

One way would be to read the first few bytes to check Byte Order Mark (if exists). More information on BOM: http://en.wikipedia.org/wiki/Byte_order_mark In the given url, you will find a table of the BOM bytes. However, one problem is, UTF-8 does not require to use BOM in its' header. There is another way to solve the problem is by pattern recognition (read few bytes-8 bits each time). Anyway, this is the complicated solution..

0 讨论(0)
发布评论:

提交评论
- 加载中...