how does lucene process dots ('.') in StringField? (issue indexing and searching file names)

风格不统一 提交于 2021-01-28 08:12:13

问题


I have a simple question I was not able to answer searching around or searching other questions: I am indexing a field which contains a filename with the following code:

doc.add(new TextField(FIELD_FILENAME, filename, Field.Store.YES))

if I index hello.jpg and then I search with the key 'hello.jpg' the entry is hit (so far so good). However, if I search with 'hello' I get no hits. If I replace '.' with another punctuation character while indexing then it works. If I escape the '.' it works as well (e.g. indexing "hello\.jpg" I find it searching for 'hello').

How Lucene process dots? Shall I expect the same issue with other characters?

Thanks a lot in advance, Stefano


回答1:


Everything depends on the analyzer you use, because the analyzer defines which tokenizer is used. The tokenizer is responsible for token extraction, which in the simplest case is similar to defining word boundaries.

Given the behavior you describe, I guess you're using the StandardAnalyzer, which uses the StandardTokenizer, which implements Unicode text segmentation, which states the following:

For example, the period (U+002E FULL STOP) is used ambiguously, sometimes for end-of-sentence purposes, sometimes for abbreviations, and sometimes for numbers.

In the document, the full stop character is part of the MidNumLet property value, and your specific case is handled by the WB6 and WB7 rules:

(ALetter | Hebrew_Letter) × (MidLetter | MidNumLet | Single_Quote) (ALetter | Hebrew_Letter)
(ALetter | Hebrew_Letter) (MidLetter | MidNumLet | Single_Quote) × (ALetter | Hebrew_Letter)

The × symbol means: no word break allowed here.

Put more simply: do not allow word breaks before or after a punctuation character if said character is immediately preceded and followed by a letter.

And the standard tokenizer grammar follows these rules:

// UAX#29 WB5.   (ALetter | Hebrew_Letter) × (ALetter | Hebrew_Letter)
//        WB6.   (ALetter | Hebrew_Letter) × (MidLetter | MidNumLet | Single_Quote) (ALetter | Hebrew_Letter)
//        WB7.   (ALetter | Hebrew_Letter) (MidLetter | MidNumLet | Single_Quote) × (ALetter | Hebrew_Letter)
//        WB7a.  Hebrew_Letter × Single_Quote
//        WB7b.  Hebrew_Letter × Double_Quote Hebrew_Letter
//        WB7c.  Hebrew_Letter Double_Quote × Hebrew_Letter
//        WB9.   (ALetter | Hebrew_Letter) × Numeric
//        WB10.  Numeric × (ALetter | Hebrew_Letter)
//        WB13.  Katakana × Katakana
//        WB13a. (ALetter | Hebrew_Letter | Numeric | Katakana | ExtendNumLet) × ExtendNumLet
//        WB13b. ExtendNumLet × (ALetter | Hebrew_Letter | Numeric | Katakana) 
//
{ExtendNumLetEx}*  ( {KatakanaEx}          ( {ExtendNumLetEx}*   {KatakanaEx}                           )*
                   | ( {HebrewLetterEx}    ( {SingleQuoteEx}     | {DoubleQuoteEx}  {HebrewLetterEx}    )
                     | {NumericEx}         ( ( {ExtendNumLetEx}* | {MidNumericEx} ) {NumericEx}         )*
                     | {HebrewOrALetterEx} ( ( {ExtendNumLetEx}* | {MidLetterEx}  ) {HebrewOrALetterEx} )*
                     )+
                   )
({ExtendNumLetEx}+ ( {KatakanaEx}          ( {ExtendNumLetEx}*   {KatakanaEx}                           )*
                   | ( {HebrewLetterEx}    ( {SingleQuoteEx}     | {DoubleQuoteEx}  {HebrewLetterEx}    )
                     | {NumericEx}         ( ( {ExtendNumLetEx}* | {MidNumericEx} ) {NumericEx}         )*
                     | {HebrewOrALetterEx} ( ( {ExtendNumLetEx}* | {MidLetterEx}  ) {HebrewOrALetterEx} )*
                     )+
                   )
)*
{ExtendNumLetEx}* 
  { return WORD_TYPE; }

In conclusion:
If you need a different behavior, you have to use a different analyzer that will better suit your goal. I guess something like LetterTokenizer wouldn't suit but you may create your own tokenizer based on CharTokenizer to implement your own rules.



来源:https://stackoverflow.com/questions/26438024/how-does-lucene-process-dots-in-stringfield-issue-indexing-and-searching

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!