Using a Combination of Wildcards and Stemming

前端 未结 4 1862
闹比i
闹比i 2020-12-30 09:38

I\'m using a snowball analyzer to stem the titles of multiple documents. Everything works well, but their are some quirks.

Example:

A search for \"valv\", \

相关标签:
4条回答
  • 2020-12-30 10:03

    This is the simplest solution and it would work -

    Add solr.KeywordRepeatFilterFactory in your 'index' analyser.

    http://lucene.apache.org/core/4_8_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/KeywordRepeatFilterFactory.html

    Also add RemoveDuplicatesTokenFilterFactory at the end of the 'index' analyzer

    Now in your index you will always have the stemmed and the non stemmed form for each token on the same position and you are good to go.

    0 讨论(0)
  • 2020-12-30 10:10

    The only potential idea I have beyond the other answers is to use the dismax against the two fields, so you can just set the relative weights of the two fields. The only caveat is that some versions of dismax didn't handle wildcards, and some parsers are Solr specific.

    0 讨论(0)
  • 2020-12-30 10:26

    I don't think that there is an easy(and correct) way to do this.

    My solution would be writing a custom query parser that finds the longest string common to the terms in the index and to your search criteria.

    class MyQueryParser : Lucene.Net.QueryParsers.QueryParser
    {
        IndexReader _reader;
        Analyzer _analyzer;
    
        public MyQueryParser(string field, Analyzer analyzer,IndexReader indexReader) : base(field, analyzer)
        {
            _analyzer = analyzer;
            _reader = indexReader;
        }
    
        public override Query GetPrefixQuery(string field, string termStr)
        {
            for(string longestStr = termStr; longestStr.Length>2; longestStr = longestStr.Substring(0,longestStr.Length-1))
            {
                TermEnum te = _reader.Terms(new Term(field, longestStr));
                Term term = te.Term();
                te.Close();
                if (term != null && term.Field() == field && term.Text().StartsWith(longestStr))
                {
                    return base.GetPrefixQuery(field, longestStr);
                }
            }
    
            return base.GetPrefixQuery(field, termStr);
        }
    }
    

    you can also try to call your analyzer in GetPrefixQuery which is not called for PrefixQuerys

    TokenStream ts = _analyzer.TokenStream(field, new StringReader(termStr));
    Lucene.Net.Analysis.Token token = ts.Next();
    var termstring = token.TermText();
    ts.Close();
    return base.GetPrefixQuery(field, termstring);
    

    But, be aware that you can always find a case where the returned results are not correct. This is why Lucene doesn't take analyzers into account when using wildcards.

    0 讨论(0)
  • 2020-12-30 10:28

    I used 2 different approach to solve this before

    1. Use two fields, one that contain stemmed terms, the other one containing terms generated by say, the StandardAnalyzer. When you parse the search query if its a wildcard search in the "standard" field, if not use the field with stemmed terms. This may be harder to use if you have the user input their queries directly in the Lucene's QueryParser.

    2. Write a custom analyzer and index overlapping tokens. It basically consist of indexing the original term and the stem at the same position in the index using the PositionIncrementAttribute. You can look into SynonymFilter to get some example of how to use the PositionIncrementAttribute correctly.

    I Prefer solution #2.

    0 讨论(0)
提交回复
热议问题