ElasticSearch - Searching For Human Names

后端 未结 1 1501
心在旅途
心在旅途 2020-12-23 09:42

I have a large database of names, primarily from Scotland. We\'re currently producing a prototype to replace an existing piece of software which carries out the search. This

相关标签:
1条回答
  • 2020-12-23 10:23

    First, I recreated your current configuration in Play: https://www.found.no/play/gist/867785a709b4869c5543

    If you go there, switch to the "Analysis"-tab to see how the text is transformed:

    Note, for example that Heaney ends up tokenized as [hn, heanei] with the search_analyzer and as [HN, heanei] with the index_analyzer. Note the case-difference for the metaphone-term. Thus, that one is not matching.

    The fuzzy-query does not do query time text analysis. Thus, you end up comparing Heavey with heanei. This has a Damerau-Levenshtein distance longer than what your parameters allow.

    What you really want to do is using the fuzzy functionality of match. Match does do query time text analysis, and has a fuzziness-parameter.

    As for the fuzziness, this changed a bit in Lucene 4. Before, it was typically specified as a float. Now it should be specified as the allowed distance. There's an outstanding pull request to clarify that: https://github.com/elasticsearch/elasticsearch/pull/4332/files

    The reason why you are getting people without the forename Michael is that you are doing a bool.should. This has OR-semantics. It's sufficient that one matches, but scoring-wise it's better the more that matches.

    Lastly, combining all that filtering into the same term is not necessarily the best approach. For example, you cannot know and boost exact spellings. What you should consider is using a multi_field to process the field in many ways.

    Here's an example you can play with, with the curl commands to recreate it below. I'd skip using the "porter" stemmer entirely for this, however. I kept it just to show how multi_field works. Using a combination of match, match with fuzziness and phonetic matching should get you far. (Make sure you don't allow fuzziness when you do phonetic matching - or you'll get uselessly fuzzy matching. :-)

    #!/bin/bash
    
    export ELASTICSEARCH_ENDPOINT="http://localhost:9200"
    
    # Create indexes
    
    curl -XPUT "$ELASTICSEARCH_ENDPOINT/play" -d '{
        "settings": {
            "analysis": {
                "text": [
                    "Michael",
                    "Heaney",
                    "Heavey"
                ],
                "analyzer": {
                    "metaphone": {
                        "type": "custom",
                        "tokenizer": "standard",
                        "filter": [
                            "my_metaphone"
                        ]
                    },
                    "porter": {
                        "type": "custom",
                        "tokenizer": "standard",
                        "filter": [
                            "lowercase",
                            "porter_stem"
                        ]
                    }
                },
                "filter": {
                    "my_metaphone": {
                        "encoder": "metaphone",
                        "replace": false,
                        "type": "phonetic"
                    }
                }
            }
        },
        "mappings": {
            "jr": {
                "properties": {
                    "pty_surename": {
                        "type": "multi_field",
                        "fields": {
                            "pty_surename": {
                                "type": "string",
                                "analyzer": "simple"
                            },
                            "metaphone": {
                                "type": "string",
                                "analyzer": "metaphone"
                            },
                            "porter": {
                                "type": "string",
                                "analyzer": "porter"
                            }
                        }
                    }
                }
            }
        }
    }'
    
    
    # Index documents
    curl -XPOST "$ELASTICSEARCH_ENDPOINT/_bulk?refresh=true" -d '
    {"index":{"_index":"play","_type":"jr"}}
    {"pty_surname":"Heaney"}
    {"index":{"_index":"play","_type":"jr"}}
    {"pty_surname":"Heavey"}
    '
    
    # Do searches
    
    curl -XPOST "$ELASTICSEARCH_ENDPOINT/_search?pretty" -d '
    {
        "query": {
            "bool": {
                "should": [
                    {
                        "bool": {
                            "should": [
                                {
                                    "match": {
                                        "pty_surname": {
                                            "query": "heavey"
                                        }
                                    }
                                },
                                {
                                    "match": {
                                        "pty_surname": {
                                            "query": "heavey",
                                            "fuzziness": 1
                                        }
                                    }
                                },
                                {
                                    "match": {
                                        "pty_surename.metaphone": {
                                            "query": "heavey"
                                        }
                                    }
                                },
                                {
                                    "match": {
                                        "pty_surename.porter": {
                                            "query": "heavey"
                                        }
                                    }
                                }
                            ]
                        }
                    }
                ]
            }
        }
    }
    '
    
    0 讨论(0)
提交回复
热议问题