Byte order mark screws up file reading in Java

后端 未结 9 2499
说谎
说谎 2020-11-22 02:55

I\'m trying to read CSV files using Java. Some of the files may have a byte order mark in the beginning, but not all. When present, the byte order gets read along with the r

9条回答
  •  无人及你
    2020-11-22 03:30

    The Apache Commons IO library has an InputStream that can detect and discard BOMs: BOMInputStream (javadoc):

    BOMInputStream bomIn = new BOMInputStream(in);
    int firstNonBOMByte = bomIn.read(); // Skips BOM
    if (bomIn.hasBOM()) {
        // has a UTF-8 BOM
    }
    

    If you also need to detect different encodings, it can also distinguish among various different byte-order marks, e.g. UTF-8 vs. UTF-16 big + little endian - details at the doc link above. You can then use the detected ByteOrderMark to choose a Charset to decode the stream. (There's probably a more streamlined way to do this if you need all of this functionality - maybe the UnicodeReader in BalusC's answer?). Note that, in general, there's not a very good way to detect what encoding some bytes are in, but if the stream starts with a BOM, apparently this can be helpful.

    Edit: If you need to detect the BOM in UTF-16, UTF-32, etc, then the constructor should be:

    new BOMInputStream(is, ByteOrderMark.UTF_8, ByteOrderMark.UTF_16BE,
            ByteOrderMark.UTF_16LE, ByteOrderMark.UTF_32BE, ByteOrderMark.UTF_32LE)
    

    Upvote @martin-charlesworth's comment :)

提交回复
热议问题