Java read file got a leading BOM [  ]

前端 未结 6 1254
孤独总比滥情好
孤独总比滥情好 2020-12-20 23:15

I am reading a file containing keywords line by line and found a strange problem. I hope lines that following each other if their contents are the same, they should be handl

相关标签:
6条回答
  • 2020-12-20 23:46

    There must be a space or some non-printable character in the start. So, either fix that or trim the Strings during/before comparison.

    [Edited]

    In case String.trim() is of no avail. Try String.replaceAll() using proper regex. Try this, str.replaceAll("\\p{Cntrl}", "").

    0 讨论(0)
  • 2020-12-20 23:47

    What is the encoding of the file?

    The unseen char at the start of the file could be the Byte Order Mark

    Saving with ANSI or UTF-8 without BOM can help highlight this for you.

    0 讨论(0)
  • 2020-12-20 23:48

    The Byte Order Mark (BOM) is a Unicode character. You will get characters like  at the start of a text stream, because BOM use is optional, and, if used, should appear at the start of the text stream.

    • Microsoft compilers and interpreters, and many pieces of software on Microsoft Windows such as Notepad treat the BOM as a required magic number rather than use heuristics. These tools add a BOM when saving text as UTF-8, and cannot interpret UTF-8 unless the BOM is present or the file contains only ASCII. Google Docs also adds a BOM when converting a document to a plain text file for download.
    File file = new File( csvFilename );
    FileInputStream inputStream = new FileInputStream(file);
    // [{"Key2":"21","Key1":"11","Key3":"31"} ]
    InputStreamReader inputStreamReader = new InputStreamReader( inputStream, "UTF-8" );
    

    We can resolve by explicitly specifying charset as UTF-8 to InputStreamReader. Then in UTF-8, the byte sequence  decodes to one character, which is U+FEFF (?).

    Using Google Guava's jar CharMatcher, you can remove any non-printable characters and then retain all ASCII characters (dropping any accents) like this:

    String printable = CharMatcher.INVISIBLE.removeFrom( input );
    String clean = CharMatcher.ASCII.retainFrom( printable );
    

    Full Example to read data from the CSV file to JSON Object:

    public class CSV_FileOperations {
        static List<HashMap<String, String>> listObjects = new ArrayList<HashMap<String,String>>();
        protected static List<JSONObject> jsonArray = new ArrayList<JSONObject >();
    
        public static void main(String[] args) {
            String csvFilename = "D:/Yashwanth/json2Bson.csv";
    
            csvToJSONString(csvFilename);
            String jsonData = jsonArray.toString();
            System.out.println("File JSON Data : \n"+ jsonData);
        }
    
        @SuppressWarnings("deprecation")
        public static String csvToJSONString( String csvFilename ) {
            try {
                File file = new File( csvFilename );
                FileInputStream inputStream = new FileInputStream(file);
    
                String fileExtensionName = csvFilename.substring(csvFilename.indexOf(".")); // fileName.split(".")[1];
                System.out.println("File Extension : "+ fileExtensionName);
    
                // [{"Key2":"21","Key1":"11","Key3":"31"} ]
                InputStreamReader inputStreamReader = new InputStreamReader( inputStream, "UTF-8" );
    
                BufferedReader buffer = new BufferedReader( inputStreamReader );
                Stream<String> readLines = buffer.lines();
                boolean headerStream = true;
    
                List<String> headers = new ArrayList<String>();
                for (String line : (Iterable<String>) () -> readLines.iterator()) {
                    String[] columns = line.split(",");
                    if (headerStream) {
                        System.out.println(" ===== Headers =====");
    
                        for (String keys : columns) {
                            //  - UTF-8 - ? « https://stackoverflow.com/a/11021401/5081877
                            String printable = CharMatcher.INVISIBLE.removeFrom( keys );
                            String clean = CharMatcher.ASCII.retainFrom(printable);
                            String key = clean.replace("\\P{Print}", "");
                            headers.add( key );
                        }
                        headerStream = false;
                        System.out.println(" ===== ----- Data ----- =====");
                    } else {
                        addCSVData(headers, columns );
                    }
                }
                inputStreamReader.close();
                buffer.close();
    
    
            } catch (FileNotFoundException e) {
                e.printStackTrace();
            } catch (IOException e) {
                e.printStackTrace();
            }
            return null;
        }
        @SuppressWarnings("unchecked")
        public static void addCSVData( List<String> headers, String[] row ) {
            if( headers.size() == row.length ) {
                HashMap<String,String> mapObj = new HashMap<String,String>();
                JSONObject jsonObj = new JSONObject();
                for (int i = 0; i < row.length; i++) {
                    mapObj.put(headers.get(i), row[i]);
                    jsonObj.put(headers.get(i), row[i]);
                }
                jsonArray.add(jsonObj);
                listObjects.add(mapObj);
            } else {
                System.out.println("Avoiding the Row Data...");
            }
        }
    }
    

    json2Bson.csv File data.

    Key1    Key2    Key3
    11  21  31
    12  22  32
    13  23  33
    
    0 讨论(0)
  • 2020-12-20 23:55

    If spaces are not important in the processing it would probably be worth doing a strLine.trim() call each time anyway. This is what I generally do when handling input like this - spaces can easily creep into a file if it has to be edited manually and if they're not important they can and should be ignored.

    Edit: is the file encoded as UTF-8? You may need to specify the encoding when you open the file. It could be the byte order mark or something like that, if it's happening on the first line.

    Try:

    BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(file), "UTF8"))
    
    0 讨论(0)
  • 2020-12-21 00:06

    Try trimming whitespace at the beginning and end of lines read. Just replace your while with:

    while ((strLine = bufferedReader.readLine()) != null) {
            strLine = strLine.trim();
            logger.info(Arrays.toString(strLine.toCharArray()));
        if(strLine.contentEquals(prevLine)){
            logger.info("Skipping the duplicate lines " + strLine);
            continue;
        }
        prevLine = strLine;
    }
    
    0 讨论(0)
  • 2020-12-21 00:07

    I had a similar case in my previous project. The culprit was the Byte order mark, which I had to get rid of. Eventually I implemented a hack based on this example. Check it out, might be that you have the same problem.

    0 讨论(0)
提交回复
热议问题