Tokenize problem in Java with separator “. ”

蓝咒 提交于 2019-12-10 15:42:58

问题


I need to split a text using the separator ". ". For example I want this string :

Washington is the U.S Capital. Barack is living there.

To be cut into two parts:

Washington is the U.S Capital. 
Barack is living there.

Here is my code :

// Initialize the tokenizer
StringTokenizer tokenizer = new StringTokenizer("Washington is the U.S Capital. Barack is living there.", ". ");
 while (tokenizer.hasMoreTokens()) {
      System.out.println(tokenizer.nextToken());

}

And the output is unfortunately :

Washington
is
the
U
S
Capital
Barack
is
living
there

Can someone explain what's going on?


回答1:


Don't use StringTokenizer; it's a legacy class. Use java.util.Scanner or simply String.split instead.

    String text = "Washington is the U.S Capital. Barack is living there.";
    String[] tokens = text.split("\\. ");
    for (String token : tokens) {
        System.out.println("[" + token + "]");
    }

This prints:

[Washington is the U.S Capital]
[Barack is living there.]

Note that split and Scanner are "regex"-based (regular expressions), and since . is a special regex "meta-character", it needs to be escaped with \. In turn, since \ is itself an escape character for Java string literals, you need to write "\\. " as the delimiter.

This may sound complicated, but it really isn't. split and Scanner are much superior to StringTokenizer, and regex isn't that hard to pick up.

Regular expressions tutorials

  • Java Lessons/Regular expressions
  • regular-expressions.info - Very good tutorial, not Java specific

Related questions

  • Scanner vs. StringTokenizer vs. String.Split

API Links

  • java.util.StringTokenizer
    • StringTokenizer is a legacy class that is retained for compatibility reasons although its use is discouraged in new code. It is recommended that anyone seeking this functionality use the split method of String or the java.util.regex package instead.
  • java.util.Scanner
    • A simple text scanner which can parse primitive types and strings using regular expressions.
    • Java Tutorials - Basic I/O - Scanning and formatting
  • String[] String.split
    • Splits this string around matches of the given regular expression.

But what went wrong?

The problem is that StringTokenizer takes each character in the delimiter string as individual delimiters, i.e. NOT the entire String itself.

From the API:

StringTokenizer(String str, String delim): Constructs a string tokenizer for the specified string. The characters in the delim argument are the delimiters for separating tokens. Delimiter characters themselves will not be treated as tokens.




回答2:


Your StringTokenizer constructor takes the delimiter ". " which matches dot or space as delimiters.




回答3:


Try eliminating the blank space after the dot in the delimiter. Use this instead.

StringTokenizer tokenizer = new StringTokenizer("Washington is the U.S Capital. Barack is living there.", ".");



回答4:


  • StringTokenizer(String str) : creates StringTokenizer with specified string.
  • StringTokenizer(String str, String delim) : creates StringTokenizer with specified string and delimiter.
  • StringTokenizer(String str, String delim, boolean returnValue) : creates StringTokenizer with specified string, delimiter and returnValue.

    If a return value is true, delimiter characters are considered to be tokens. If it is false, then delimiter characters serve to separate tokens.



来源:https://stackoverflow.com/questions/2972199/tokenize-problem-in-java-with-separator

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!