I\'ve been having an issue getting the java.util.Scanner to read a text file I saved in Notepad, even though it works fine with others. Basically, when it tries to read the
If you don't specify an encoding when you create the scanner it will try to divine the encoding based on a byte order mark (BOM), which is the first few bytes of a file. If it doesn't have one, it will default to whatever default the OS uses. Since you're using Windows, the default is cp-1252. It seems that notepad is saving your text file using ISO-8859-1 which is similar, but not that same as cp-1252. See this link for more details:
http://www.i18nqa.com/debug/table-iso8859-1-vs-windows-1252.html
When you save it as UTF-8, it probably places the UTF-8 BOM at the beginning of the file and the scanner can pick up on it.
If you want to look more into BOM, look it up in wikipedia--the article is quite good. You can also download PSPad and open the text file in hex mode to see the individual bytes. Hope that helps :)
Scanner
's hasNextLine
method will just return false if it encountered encoding error in the input file. Without any exception. This is frustrating, and it is not documented anywhere, even in JDK 8 documentation.
If you just want to read a file line-by-line, use this instead:
final BufferedReader input = new BufferedReader(new InputStreamReader(new FileInputStream("inputfile.txt"), "inputencoding"));
while (true) {
String line = input.readLine();
if (line == null) break;
// process line
}
input.close();
Make sure the inputencoding
above is replaced with the correct encoding of the file. Most likely it is utf-8
or ascii
. Even if the encoding mismatches, it won't prematurely terminate like Scanner
.
Some time ago I had similar problem with configuration file which was edited by the user. Because I never know what type of editor user will use I try this:
org.mozilla.universalchardet.UniversalDetector
available from here:
https://code.google.com/p/juniversalchardet/
The detecting char encoding is not simple thing so I can't be sure if this library works at any condition, but for me was sufficient. Have a look, maybe will help somehow to detect your encoding and later set it to Scanner
.