Curly quotes causing Java Scanner hasNextLine() to be false — why?

前端 未结 3 1801
遥遥无期
遥遥无期 2021-01-17 12:22

I\'ve been having an issue getting the java.util.Scanner to read a text file I saved in Notepad, even though it works fine with others. Basically, when it tries to read the

相关标签:
3条回答
  • 2021-01-17 12:24

    If you don't specify an encoding when you create the scanner it will try to divine the encoding based on a byte order mark (BOM), which is the first few bytes of a file. If it doesn't have one, it will default to whatever default the OS uses. Since you're using Windows, the default is cp-1252. It seems that notepad is saving your text file using ISO-8859-1 which is similar, but not that same as cp-1252. See this link for more details:

    http://www.i18nqa.com/debug/table-iso8859-1-vs-windows-1252.html

    When you save it as UTF-8, it probably places the UTF-8 BOM at the beginning of the file and the scanner can pick up on it.

    If you want to look more into BOM, look it up in wikipedia--the article is quite good. You can also download PSPad and open the text file in hex mode to see the individual bytes. Hope that helps :)

    0 讨论(0)
  • 2021-01-17 12:45

    Scanner's hasNextLine method will just return false if it encountered encoding error in the input file. Without any exception. This is frustrating, and it is not documented anywhere, even in JDK 8 documentation.

    If you just want to read a file line-by-line, use this instead:

    final BufferedReader input = new BufferedReader(new InputStreamReader(new FileInputStream("inputfile.txt"), "inputencoding"));
    
    while (true) {
        String line = input.readLine();
        if (line == null) break;
        // process line
    }
    
    input.close();
    

    Make sure the inputencoding above is replaced with the correct encoding of the file. Most likely it is utf-8 or ascii. Even if the encoding mismatches, it won't prematurely terminate like Scanner.

    0 讨论(0)
  • 2021-01-17 12:47

    Some time ago I had similar problem with configuration file which was edited by the user. Because I never know what type of editor user will use I try this:

    org.mozilla.universalchardet.UniversalDetector
    

    available from here:

    https://code.google.com/p/juniversalchardet/
    

    The detecting char encoding is not simple thing so I can't be sure if this library works at any condition, but for me was sufficient. Have a look, maybe will help somehow to detect your encoding and later set it to Scanner.

    0 讨论(0)
提交回复
热议问题