Lucene wildcard matching fails on chemical notations(?)

有些话、适合烂在心里 提交于 2019-12-24 01:16:06

问题


Using Hibernate Search Annotations (mostly just @Field(index = Index.TOKENIZED)) I've indexed a number of fields related to a persisted class of mine called Compound. I've setup text search over all the indexed fields, using the MultiFieldQueryParser, which has so far worked fine.

Among the fields indexed and searchable is a field called compoundName, with sample values:

  • 3-Hydroxyflavone
  • 6,4'-Dihydroxyflavone

When I search for either of these values in full the related Compound instances are returned. However problems occur when I use the partial name and introduce wildcards:

  • searching for 3-Hydroxyflav* still gives the correct hit, but
  • searching for 6,4'-Dihydroxyflav* fails to find anything.

Now as I'm quite new to Lucene / Hibernate-search, I'm not quite sure where to look at this point.. I think it might have something to do with the ' present in the second query, but I don't know how to proceed.. Should I look into Tokenizers / Analyzers / QueryParsers or something else entirely?

Or can anyone tell me how I can get the second wildcard search to match, preferably without breaking the MultiField-search behavior?

I'm using Hibernate-Search 3.1.0.GA & Lucene-core 2.9.3.


Some relevant code bits to illustrate my current approach:

Relevant parts of the indexed Compound class:

@Entity
@Indexed
@Data
@EqualsAndHashCode(callSuper = false, of = { "inchikey" })
public class Compound extends DomainObject {
    @NaturalId
    @NotEmpty
    @Length(max = 30)
    @Field(index = Index.TOKENIZED)
    private String                  inchikey;

    @ManyToOne
    @IndexedEmbedded
    private ChemicalClass           chemicalClass;

    @Field(index = Index.TOKENIZED)
    private String                  commonName;
...
}

How I currently search over the indexed fields:

String[] searchfields = Compound.getSearchfields();
MultiFieldQueryParser parser = 
    new MultiFieldQueryParser(Version.LUCENE_29, searchfields, new StandardAnalyzer(Version.LUCENE_29));
FullTextSession fullTextSession = Search.getFullTextSession(getSession());
FullTextQuery fullTextQuery = 
    fullTextSession.createFullTextQuery(parser.parse("searchterms"), Compound.class);
List<Compound> hits = fullTextQuery.list();

回答1:


I think your problem is a combination of analyzer and query language problems. It is hard to say what exactly causes the problem. To find this out I recommend you inspect you index using the Lucene index tool Luke.

Since in your Hibernate Search configuration you are not using a custom analyzer the default - StandardAnalyzer - is used. This would be consistent with the fact that you use StandardAnalyzer in the constructor of MultiFieldQueryParser (always use the same analyzer for indexing and searching!). What I am not so sure of is how "6,4'-Dihydroxyflavone" gets tokenized by StandardAnalyzer. That the first thing you have to find out. For example the javadoc says:

Splits words at hyphens, unless there's a number in the token, in which case the whole token is interpreted as a product number and is not split.

It might be that you need to write your own analyzer which tokenizes your chemical names the way you need it for your use cases.

Next the query parser. Make sure you understand the query syntax - Lucene query syntax. Some characters have special meaning, for example a '-'. It could be that your query is parsed the wrong way.

Either way, first step os to find out how your chemical names get tokenized. Hope that helps.




回答2:


Use WhitespaceAnalyzer instead of StandardAnalyzer. It will just split at whitespace, and not at commas, hyphens etc. (It will not lowercase them though, so you will need to build your own chain of whitespace + lowercase, assuming you want your search to be case-insensitive). If you need to do things differently for different fields, you can use a PerFieldAnalyzer.

You can't just set it to un-tokenized, because that will interpret your entire body of text as one token.




回答3:


I wrote my own analyzer:

import java.util.Set;
import java.util.regex.Pattern;

import org.apache.lucene.index.memory.PatternAnalyzer;
import org.apache.lucene.util.Version;

public class ChemicalNameAnalyzer extends PatternAnalyzer {

    private static Version version = Version.LUCENE_29;
    private static Pattern pattern = compilePattern();
    private static boolean toLowerCase = true;
    private static Set stopWords = null;

    public ChemicalNameAnalyzer(){
        super(version, pattern, toLowerCase, stopWords);
    }

    public static Pattern compilePattern() {
        StringBuilder sb =  new StringBuilder();
        sb.append("(-{0,1}\\(-{0,1})");//Matches an optional dash followed by an opening round bracket followed by an optional dash  
        sb.append("|");//"OR" (regex alternation)
        sb.append("(-{0,1}\\)-{0,1})"); 
        sb.append("|");//"OR" (regex alternation)
        sb.append("((?<=([a-zA-Z]{2,}))-(?=([^a-zA-Z])))");//Matches a dash ("-") preceded by two or more letters and succeeded by a non-letter
        return Pattern.compile(sb.toString());
    }
}


来源:https://stackoverflow.com/questions/3779411/lucene-wildcard-matching-fails-on-chemical-notations

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!