Searching names with Apache Solr

后端 未结 5 1246
抹茶落季
抹茶落季 2020-12-12 21:20

I\'ve just ventured into the seemingly simple but extremely complex world of searching. For an application, I am required to build a search mechanism for searching users by

相关标签:
5条回答
  • 2020-12-12 21:42

    The answer in another post is pretty good: Training solr to recognize nicknames or name variants

    <fieldType name="name_en" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.ASCIIFoldingFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.ASCIIFoldingFilterFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="english_names.txt" ignoreCase="true" expand="true"/>
      </analyzer>
    </fieldType>
    
    0 讨论(0)
  • 2020-12-12 21:43

    We created a simple 'name' field type that allows mixing both 'key' (e.g., SOUNDEX) and 'pairwise' portions of the answers above.

    Here's the overview:

    1. at index time, fields of the custom type are indexed into a set of (sub) fields with respective values used for high-recall matching different kinds of variations

    Here's the core of its implementation...

    List<IndexableField> createFields(SchemaField field, String name) {
            Collection<FieldSpec> nameFields = deriveFieldsForName(name);
            List<IndexableField> docFields = new ArrayList<>();
            for (FieldSpec fs : nameFields) {
                docFields.add(new Field(fs.getName(), fs.getStringValue(),
                             fs.getLuceneField()));
            }
            docFields.add(createDocValues(field.getName(), new Name(name)));
            return docFields;
    }
    

    The heart of this is deriveFieldsForName(name) in which you can include 'keys' from PhoneticFilters, LowerCaseFolding, etc.

    1. at query time, first a custom Lucene query is produced that has been tuned for recall and that uses the same fields as index time

    Here's the core of its implementation...

    public Query getFieldQuery(QParser parser, SchemaField field, String val) {
            Name name = parseNameString(externalVal, parser.getParams());
            QuerySpec querySpec = buildQuery(name);
            return querySpec.accept(new SolrQueryVisitor(field.getName())); 
    }
    

    The heart of this is the buildQuery(name) method which should produce a query that is aware of deriveFieldsForName(name) above so for a given query name it will find good candidate names.

    1. then second, Solr’s Rerank feature is used to apply a high-precision re-scoring algorithm to reorder the results

    Here's what this looks like in your query...

    &rq={!myRerank reRankQuery=$rrq} &rrq={!func}myMatch(fieldName, "John Doe")
    

    The content of myMatch could have a pairwise Levenstein or Jaro-Winkler implementation.

    N.B. Our own full implementation uses proprietary code for deriveFieldsForName, buildQuery, and myMatch (see http://www.basistech.com/text-analytics/rosette/name-indexer/) to handle more kinds of variations that the ones mentioned above (e.g., missing spaces, cross-language).

    0 讨论(0)
  • 2020-12-12 21:44

    For phonetic name search you might also try the Beider-Morse Filter which works pretty well if you have a mixture of names from different countries.

    If you want to use it with a typeahead feature, combine it with an EdgeNGramFilter:

    <fieldType name="phoneticNames" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.BeiderMorseFilterFactory" nameType="GENERIC" ruleType="APPROX" concat="true" languageSet="auto"/>
        <filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="15"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.BeiderMorseFilterFactory" nameType="GENERIC" ruleType="APPROX" concat="true" languageSet="auto"/>
      </analyzer>
    </fieldType>
    
    0 讨论(0)
  • 2020-12-12 21:49

    It sounds like you are catering for a corpus with searches that you need to match very loosely?

    If you are doing that you will want to choose your fields and set different boosts to rank your results.

    So have separate "copied" fields in solr:

    • one field for exact full name (with filters)
    • multivalued field with filters ASCIIFolding, Lowercase...
    • multivalued field with the SynonymFilterFactory ASCIIFolding, Lowercase...
    • PhoneticFilterFactory (with Caverphone or Double-Metaphone)

    See Also: more non-english Soundex discussion

    Synonyms for names, I don't know if there is a public synonym db available.

    Fuzzy searching, I've not found it useful, it uses Levenshtein Distance.

    Other filters and indexing get more superior "search relevant" results.

    Unicode characters in names can be handled with the ASCIIFoldingFilterFactory

    You are describing solutions up front for expected use cases.

    If you want quality results, plan on tuning your Search Relevance

    This tuning will be especially valuable, when attempting to match on synonyms, like MacDonald and McDonald (which has a larger Levenshtein distance than Carl and Karl).

    0 讨论(0)
  • 2020-12-12 22:06

    Found a nickname db, not sure how good: http://www.peacockdata2.com/products/pdnickname/

    Note that it's not free.

    0 讨论(0)
提交回复
热议问题