问题
I'm having a problem reading UTF-8 characters in my code (running on Eclipse).
I have a file text
which has a few lines in it, for example:
אך 1234
NOTE: There is a \t
before the word, and the word should appear on the left, the number on the right... I don't know how to reverse them here, sorry.
That is, a Hebrew word and then a number.
I need to separate the word from the number somehow. I tried this:
BufferedReader br = new BufferedReader(new FileReader(text));
String content;
while ((content = br.readLine()) != null)
{
String delims = "[ ]+";
String[] tokens = content.split(delims);
}
The problem is that for some reason, the code reads content
(the first line in the file) as follows:
אך\t1234
...meaning that the space isn't in its correct place.
I suppose I could tokenize the text using the \t
, but I'm not sure I should do it, as the file isn't being read correctly...
Does anyone have any idea why this happens?
Thanks so much :-)
回答1:
I think you are matching a space when there actually is a tab there?
Can you try this:
BufferedReader br = new BufferedReader(new FileReader(text));
String content;
while ((content = br.readLine()) != null)
{
String delims = "\\s";
String[] tokens = content.split(delims);
}
来源:https://stackoverflow.com/questions/22290449/java-code-reads-utf-8-text-incorrectly