Byte order mark screws up file reading in Java

后端 未结 9 2492
说谎
说谎 2020-11-22 02:55

I\'m trying to read CSV files using Java. Some of the files may have a byte order mark in the beginning, but not all. When present, the byte order gets read along with the r

相关标签:
9条回答
  • 2020-11-22 03:26

    Regrettably not. You'll have to identify and skip yourself. This page details what you have to watch for. Also see this SO question for more details.

    0 讨论(0)
  • 2020-11-22 03:27

    To simply remove the BOM characters from your file, I recomend using Apache Common IO

    public BOMInputStream(InputStream delegate,
                  boolean include)
    Constructs a new BOM InputStream that detects a a ByteOrderMark.UTF_8 and optionally includes it.
    Parameters:
    delegate - the InputStream to delegate to
    include - true to include the UTF-8 BOM or false to exclude it
    

    Set include to false and your BOM characters will be excluded.

    0 讨论(0)
  • 2020-11-22 03:28

    More simple solution:

    public class BOMSkipper
    {
        public static void skip(Reader reader) throws IOException
        {
            reader.mark(1);
            char[] possibleBOM = new char[1];
            reader.read(possibleBOM);
    
            if (possibleBOM[0] != '\ufeff')
            {
                reader.reset();
            }
        }
    }
    

    Usage sample:

    BufferedReader input = new BufferedReader(new InputStreamReader(new FileInputStream(file), fileExpectedCharset));
    BOMSkipper.skip(input);
    //Now UTF prefix not present:
    input.readLine();
    ...
    

    It works with all 5 UTF encodings!

    0 讨论(0)
  • 2020-11-22 03:30

    The Apache Commons IO library has an InputStream that can detect and discard BOMs: BOMInputStream (javadoc):

    BOMInputStream bomIn = new BOMInputStream(in);
    int firstNonBOMByte = bomIn.read(); // Skips BOM
    if (bomIn.hasBOM()) {
        // has a UTF-8 BOM
    }
    

    If you also need to detect different encodings, it can also distinguish among various different byte-order marks, e.g. UTF-8 vs. UTF-16 big + little endian - details at the doc link above. You can then use the detected ByteOrderMark to choose a Charset to decode the stream. (There's probably a more streamlined way to do this if you need all of this functionality - maybe the UnicodeReader in BalusC's answer?). Note that, in general, there's not a very good way to detect what encoding some bytes are in, but if the stream starts with a BOM, apparently this can be helpful.

    Edit: If you need to detect the BOM in UTF-16, UTF-32, etc, then the constructor should be:

    new BOMInputStream(is, ByteOrderMark.UTF_8, ByteOrderMark.UTF_16BE,
            ByteOrderMark.UTF_16LE, ByteOrderMark.UTF_32BE, ByteOrderMark.UTF_32LE)
    

    Upvote @martin-charlesworth's comment :)

    0 讨论(0)
  • 2020-11-22 03:32

    EDIT: I've made a proper release on GitHub: https://github.com/gpakosz/UnicodeBOMInputStream


    Here is a class I coded a while ago, I just edited the package name before pasting. Nothing special, it is quite similar to solutions posted in SUN's bug database. Incorporate it in your code and you're fine.

    /* ____________________________________________________________________________
     * 
     * File:    UnicodeBOMInputStream.java
     * Author:  Gregory Pakosz.
     * Date:    02 - November - 2005    
     * ____________________________________________________________________________
     */
    package com.stackoverflow.answer;
    
    import java.io.IOException;
    import java.io.InputStream;
    import java.io.PushbackInputStream;
    
    /**
     * The <code>UnicodeBOMInputStream</code> class wraps any
     * <code>InputStream</code> and detects the presence of any Unicode BOM
     * (Byte Order Mark) at its beginning, as defined by
     * <a href="http://www.faqs.org/rfcs/rfc3629.html">RFC 3629 - UTF-8, a transformation format of ISO 10646</a>
     * 
     * <p>The
     * <a href="http://www.unicode.org/unicode/faq/utf_bom.html">Unicode FAQ</a>
     * defines 5 types of BOMs:<ul>
     * <li><pre>00 00 FE FF  = UTF-32, big-endian</pre></li>
     * <li><pre>FF FE 00 00  = UTF-32, little-endian</pre></li>
     * <li><pre>FE FF        = UTF-16, big-endian</pre></li>
     * <li><pre>FF FE        = UTF-16, little-endian</pre></li>
     * <li><pre>EF BB BF     = UTF-8</pre></li>
     * </ul></p>
     * 
     * <p>Use the {@link #getBOM()} method to know whether a BOM has been detected
     * or not.
     * </p>
     * <p>Use the {@link #skipBOM()} method to remove the detected BOM from the
     * wrapped <code>InputStream</code> object.</p>
     */
    public class UnicodeBOMInputStream extends InputStream
    {
      /**
       * Type safe enumeration class that describes the different types of Unicode
       * BOMs.
       */
      public static final class BOM
      {
        /**
         * NONE.
         */
        public static final BOM NONE = new BOM(new byte[]{},"NONE");
    
        /**
         * UTF-8 BOM (EF BB BF).
         */
        public static final BOM UTF_8 = new BOM(new byte[]{(byte)0xEF,
                                                           (byte)0xBB,
                                                           (byte)0xBF},
                                                "UTF-8");
    
        /**
         * UTF-16, little-endian (FF FE).
         */
        public static final BOM UTF_16_LE = new BOM(new byte[]{ (byte)0xFF,
                                                                (byte)0xFE},
                                                    "UTF-16 little-endian");
    
        /**
         * UTF-16, big-endian (FE FF).
         */
        public static final BOM UTF_16_BE = new BOM(new byte[]{ (byte)0xFE,
                                                                (byte)0xFF},
                                                    "UTF-16 big-endian");
    
        /**
         * UTF-32, little-endian (FF FE 00 00).
         */
        public static final BOM UTF_32_LE = new BOM(new byte[]{ (byte)0xFF,
                                                                (byte)0xFE,
                                                                (byte)0x00,
                                                                (byte)0x00},
                                                    "UTF-32 little-endian");
    
        /**
         * UTF-32, big-endian (00 00 FE FF).
         */
        public static final BOM UTF_32_BE = new BOM(new byte[]{ (byte)0x00,
                                                                (byte)0x00,
                                                                (byte)0xFE,
                                                                (byte)0xFF},
                                                    "UTF-32 big-endian");
    
        /**
         * Returns a <code>String</code> representation of this <code>BOM</code>
         * value.
         */
        public final String toString()
        {
          return description;
        }
    
        /**
         * Returns the bytes corresponding to this <code>BOM</code> value.
         */
        public final byte[] getBytes()
        {
          final int     length = bytes.length;
          final byte[]  result = new byte[length];
    
          // Make a defensive copy
          System.arraycopy(bytes,0,result,0,length);
    
          return result;
        }
    
        private BOM(final byte bom[], final String description)
        {
          assert(bom != null)               : "invalid BOM: null is not allowed";
          assert(description != null)       : "invalid description: null is not allowed";
          assert(description.length() != 0) : "invalid description: empty string is not allowed";
    
          this.bytes          = bom;
          this.description  = description;
        }
    
                final byte    bytes[];
        private final String  description;
    
      } // BOM
    
      /**
       * Constructs a new <code>UnicodeBOMInputStream</code> that wraps the
       * specified <code>InputStream</code>.
       * 
       * @param inputStream an <code>InputStream</code>.
       * 
       * @throws NullPointerException when <code>inputStream</code> is
       * <code>null</code>.
       * @throws IOException on reading from the specified <code>InputStream</code>
       * when trying to detect the Unicode BOM.
       */
      public UnicodeBOMInputStream(final InputStream inputStream) throws  NullPointerException,
                                                                          IOException
    
      {
        if (inputStream == null)
          throw new NullPointerException("invalid input stream: null is not allowed");
    
        in = new PushbackInputStream(inputStream,4);
    
        final byte  bom[] = new byte[4];
        final int   read  = in.read(bom);
    
        switch(read)
        {
          case 4:
            if ((bom[0] == (byte)0xFF) &&
                (bom[1] == (byte)0xFE) &&
                (bom[2] == (byte)0x00) &&
                (bom[3] == (byte)0x00))
            {
              this.bom = BOM.UTF_32_LE;
              break;
            }
            else
            if ((bom[0] == (byte)0x00) &&
                (bom[1] == (byte)0x00) &&
                (bom[2] == (byte)0xFE) &&
                (bom[3] == (byte)0xFF))
            {
              this.bom = BOM.UTF_32_BE;
              break;
            }
    
          case 3:
            if ((bom[0] == (byte)0xEF) &&
                (bom[1] == (byte)0xBB) &&
                (bom[2] == (byte)0xBF))
            {
              this.bom = BOM.UTF_8;
              break;
            }
    
          case 2:
            if ((bom[0] == (byte)0xFF) &&
                (bom[1] == (byte)0xFE))
            {
              this.bom = BOM.UTF_16_LE;
              break;
            }
            else
            if ((bom[0] == (byte)0xFE) &&
                (bom[1] == (byte)0xFF))
            {
              this.bom = BOM.UTF_16_BE;
              break;
            }
    
          default:
            this.bom = BOM.NONE;
            break;
        }
    
        if (read > 0)
          in.unread(bom,0,read);
      }
    
      /**
       * Returns the <code>BOM</code> that was detected in the wrapped
       * <code>InputStream</code> object.
       * 
       * @return a <code>BOM</code> value.
       */
      public final BOM getBOM()
      {
        // BOM type is immutable.
        return bom;
      }
    
      /**
       * Skips the <code>BOM</code> that was found in the wrapped
       * <code>InputStream</code> object.
       * 
       * @return this <code>UnicodeBOMInputStream</code>.
       * 
       * @throws IOException when trying to skip the BOM from the wrapped
       * <code>InputStream</code> object.
       */
      public final synchronized UnicodeBOMInputStream skipBOM() throws IOException
      {
        if (!skipped)
        {
          in.skip(bom.bytes.length);
          skipped = true;
        }
        return this;
      }
    
      /**
       * {@inheritDoc}
       */
      public int read() throws IOException
      {
        return in.read();
      }
    
      /**
       * {@inheritDoc}
       */
      public int read(final byte b[]) throws  IOException,
                                              NullPointerException
      {
        return in.read(b,0,b.length);
      }
    
      /**
       * {@inheritDoc}
       */
      public int read(final byte b[],
                      final int off,
                      final int len) throws IOException,
                                            NullPointerException
      {
        return in.read(b,off,len);
      }
    
      /**
       * {@inheritDoc}
       */
      public long skip(final long n) throws IOException
      {
        return in.skip(n);
      }
    
      /**
       * {@inheritDoc}
       */
      public int available() throws IOException
      {
        return in.available();
      }
    
      /**
       * {@inheritDoc}
       */
      public void close() throws IOException
      {
        in.close();
      }
    
      /**
       * {@inheritDoc}
       */
      public synchronized void mark(final int readlimit)
      {
        in.mark(readlimit);
      }
    
      /**
       * {@inheritDoc}
       */
      public synchronized void reset() throws IOException
      {
        in.reset();
      }
    
      /**
       * {@inheritDoc}
       */
      public boolean markSupported() 
      {
        return in.markSupported();
      }
    
      private final PushbackInputStream in;
      private final BOM                 bom;
      private       boolean             skipped = false;
    
    } // UnicodeBOMInputStream
    

    And you're using it this way:

    import java.io.BufferedReader;
    import java.io.FileInputStream;
    import java.io.InputStreamReader;
    
    public final class UnicodeBOMInputStreamUsage
    {
      public static void main(final String[] args) throws Exception
      {
        FileInputStream fis = new FileInputStream("test/offending_bom.txt");
        UnicodeBOMInputStream ubis = new UnicodeBOMInputStream(fis);
    
        System.out.println("detected BOM: " + ubis.getBOM());
    
        System.out.print("Reading the content of the file without skipping the BOM: ");
        InputStreamReader isr = new InputStreamReader(ubis);
        BufferedReader br = new BufferedReader(isr);
    
        System.out.println(br.readLine());
    
        br.close();
        isr.close();
        ubis.close();
        fis.close();
    
        fis = new FileInputStream("test/offending_bom.txt");
        ubis = new UnicodeBOMInputStream(fis);
        isr = new InputStreamReader(ubis);
        br = new BufferedReader(isr);
    
        ubis.skipBOM();
    
        System.out.print("Reading the content of the file after skipping the BOM: ");
        System.out.println(br.readLine());
    
        br.close();
        isr.close();
        ubis.close();
        fis.close();
      }
    
    } // UnicodeBOMInputStreamUsage
    
    0 讨论(0)
  • 2020-11-22 03:39

    NotePad++ is a good tool to convert UTF-8 encoding to UTF-8(BOM) encoding.

    https://notepad-plus-plus.org/downloads/

    UTF8BOMTester.java

    public class UTF8BOMTester {
    
    public static void main(String[] args) throws FileNotFoundException, IOException {
        // TODO Auto-generated method stub
        File file = new File("test.txt");
        boolean same = UTF8BOMInputStream.isSameEncodingType(file);
        System.out.println(same);
        if (same) {
            UTF8BOMInputStream is = new UTF8BOMInputStream(file);
            BufferedReader br = new BufferedReader(new InputStreamReader(is, "UTF-8"));
            System.out.println(br.readLine());
        }
    
    }
    
    static void bytesPrint(byte[] b) {
        for (byte a : b)
            System.out.printf("%x ", a);
    }}
    

    UTF8BOMInputStream.java

    public class UTF8BOMInputStream extends InputStream {
    
    byte[] SYMBLE_BOM = { (byte) 0xEF, (byte) 0xBB, (byte) 0xBF };
    FileInputStream fis;
    final boolean isSameEncodingType;
    public UTF8BOMInputStream(File file) throws IOException {
        FileInputStream fis=new FileInputStream(file);
        byte[] symble=new byte[3];
        fis.read(symble);
        bytesPrint(symble);
        isSameEncodingType=isSameEncodingType(symble);
        if(isSameEncodingType)
            this.fis=fis;
        else
            this.fis=null;
        
    }
    
    @Override
    public int read() throws IOException {
        return fis.read();
    }
    
    void bytesPrint(byte[] b) {
        for (byte a : b)
            System.out.printf("%x ", a);
    }
    
    boolean bytesCompare(byte[] a, byte[] b) {
        if (a.length != b.length)
            return false;
    
        for (int i = 0; i < a.length; i++) {
            if (a[i] != b[i])
                return false;
        }
        return true;
    }
    boolean isSameEncodingType(byte[] symble) {
        return bytesCompare(symble,SYMBLE_BOM);
    }
    public static boolean isSameEncodingType(File file) throws IOException {
        return (new UTF8BOMInputStream(file)).isSameEncodingType;
    }
    
    0 讨论(0)
提交回复
热议问题