Difference between WhitespaceTokenizerFactory and StandardTokenizerFactory

I am new to Solr. By reading Solr's wiki, I don't understand the differences between WhitespaceTokenizerFactory and StandardTokenizerFactory. What's their real difference?

They differ in how they split the analyzed text into tokens.

The StandardTokenizer does this based on the following (taken from lucene javadoc):

Splits words at punctuation characters, removing punctuation. However, a dot that's not followed by whitespace is considered part of a token.
Splits words at hyphens, unless there's a number in the token, in which case the whole token is interpreted as a product number and is not split.
Recognizes email addresses and internet hostnames as one token.

The WhitespaceTokenizer does this based on whitespace characters:

A WhitespaceTokenizer is a tokenizer that divides text at whitespace. Adjacent sequences of non-Whitespace characters form tokens.

You should pick the tokenizer that best fits your application. In any case you have to use the same analyzer/tokenizers for indexing and searching!

来源：https://stackoverflow.com/questions/11183017/difference-between-whitespacetokenizerfactory-and-standardtokenizerfactory

标签

solr

tokenize

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!