Solr date field tdate vs date?

后端 未结 3 423
闹比i
闹比i 2021-02-01 15:09

So I have a question about Solr\'s field date types which is pretty straight forward: what\'s the difference between a \'date\' field and a \'tdate\' one?

The schema .xm

相关标签:
3条回答
  • 2021-02-01 16:00

    Your best bet is to just look at the source code. Some of the things for Solr aren't well documented and the fastest way to get a trustworthy answer is to simply look at the code. If you haven't been in the code yet, that too is to your benefit. At least in the long run.

    Here's a link to the TrieTokenizerFactory.

    http://www.jarvana.com/jarvana/view/org/apache/solr/solr-core/1.4.1/solr-core-1.4.1-sources.jar!/org/apache/solr/analysis/TrieTokenizerFactory.java?format=ok

    The javadoc in the class at least hints at the purpose of the precisionStep. You could dig futher.

    EDIT: I dug a bit further for you. It's passed off directly to Lucene's NumericTokenStream class, which will used the value during parsing the token stream. Probably worth closer examination. It seems to deal with granularity and is probably a tradeoff between size in the index and speed.

    0 讨论(0)
  • 2021-02-01 16:05

    Trie fields make range queries faster by precomputing certain range results and storing them as a single record in the index. For clarity, my example will use integers in base ten. The same concept applies to all trie types. This includes dates, since a date can be represented as the number of seconds since, say, 1970.

    Let's say we index the number 12345678. We can tokenize this into the following tokens.

    12345678
    123456xx
    1234xxxx
    12xxxxxx
    

    The 12345678 token represents the actual integer value. The tokens with the x digits represent ranges. 123456xx represents the range 12345600 to 12345699, and matches all the documents that contain a token in that range.

    Notice how in each token on the list has successively more x digits. This is controlled by the precision step. In my example, you could say that I was using a precision step of 2, since I trim 2 digits to create each extra token. If I were to use a precision step of 3, I would get these tokens.

    12345678
    12345xxx
    12xxxxxx
    

    A precision step of 4:

    12345678
    1234xxxx
    

    A precision step of 1:

    12345678
    1234567x
    123456xx
    12345xxx
    1234xxxx
    123xxxxx
    12xxxxxx
    1xxxxxxx
    

    It's easy to see how a smaller precision step results in more tokens and increases the size of the index. However, it also speeds up range queries.

    Without the trie field, if I wanted to query a range from 1250 to 1275, Lucene would have to fetch 25 entries (1250, 1251, 1252, ..., 1275) and combine search results. With a trie field (and precision step of 1), we could get away with fetching 8 entries (125x, 126x, 1270, 1271, 1272, 1273, 1274, 1275), because 125x is a precomputed aggregation of 1250 - 1259. If I were to use a precision step larger than 1, the query would go back to fetching all 25 individual entries.

    Note: In reality, the precision step refers to the number of bits trimmed for each token. If you were to write your numbers in hexadecimal, a precision step of 4 would trim one hex digit for each token. A precision step of 8 would trim two hex digits.

    0 讨论(0)
  • 2021-02-01 16:07

    Basically trie ranges are faster. Here is one explanation. With precisionStep you configure how much your index can grow to get the performance benefits. To quote from the link you are referring:

    More importantly, it is not dependent on the index size, but instead the precision chosen.

    and

    the only drawbacks of TrieRange are a little bit larger index sizes, because of the additional terms indexed

    0 讨论(0)
提交回复
热议问题