String tokenization in java (LARGE text)

删除回忆录丶 提交于 2019-12-11 12:06:47

问题


I have this large text (read LARGE). I need to tokenize every word, delimit on every non-letter. I used StringTokenizer to read one word at a time. However, as I was researching how to write the delimiter string ("every non-letter") instead of doing something like:

new StringTokenizer(text, "\" ();,.'[]{}!?:”“…\n\r0123456789 [etc etc]");

I found that everyone basically hates StringTokenizer (why?).

So, what can I use instead? Dont suggest String.split as it will duplicate my large text. I need to go through the text word by word and delimit on every non-letter. Is it easier to build something on my own or is there some best practice way to confront this problem?

Thanks in advance!


回答1:


You can use the flexible string Splitter class from Google's guava library.

If you need something more powerful, have a look at StandardTokenizer from Apache Lucene. From the docs:

This should be a good tokenizer for most European-language documents:

  • Splits words at punctuation characters, removing punctuation. However, a dot that's not followed by whitespace is considered part of a token.
  • Splits words at hyphens, unless there's a number in the token, in which case the whole token is interpreted as a product number and is not split.
  • Recognizes email addresses and internet hostnames as one token.



回答2:


StringTokenizer, as per the docs "StringTokenizer is a legacy class that is retained for compatibility reasons although its use is discouraged in new code. It is recommended that anyone seeking this functionality use the split method of String or the java.util.regex package instead." That pretty much sums up the StringTokenizer hate.

If memory is really a concern, you can just iterate over the string character-by-character and substring between delimiters, do your processing, then move on.




回答3:


It your grammar is complex and your file is large you can consider to use JavaCC.

When I'm in your situation I use it.




回答4:


Scanner.class read word by word (or line by line), and it can be used on large file (or input stream).

Pattern for RegEx can detect space, and many things (look at § where you can find something like \p{..}




回答5:


I was never a fan of regex, but I can't see anything wrong with just using "[^a-zA-Z]" for the StringTokenizer.



来源:https://stackoverflow.com/questions/10052882/string-tokenization-in-java-large-text

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!