case insensitive elasticsearch with uppercase or lowercase

怎甘沉沦 提交于 2019-12-21 21:28:11

问题


I am working with elastic search and I am facing a problem. if any body gave me a hint , I will really thankful.

I want to analyze a field "name" or "description" which consist of different entries . e.g someone want to search Sara. if he enter SARA, SAra or sara. he should be able to get Sara. elastic search uses analyzer which makes everything lowercase.

I want to implement it case insensitive regardless of user input uppercase or lowercase name, he/she should get results. I am using ngram filter to search names and lowercase which makes it case insensitive. But I want to make sure that a person get results if even he enters in uppercase or lowercase.

Is there any way to do this in elastic search?

{"settings": {

        "analysis": {
            "filter": {
                "ngram_filter": {
                    "type": "ngram",
                    "min_gram": 1,
                    "max_gram": 80
                }
            },
            "analyzer": {
                "index_ngram": {
                    "type": "custom",
                    "tokenizer": "keyword",
                    "filter": [ "ngram_filter", "lowercase" ]
                },

I attach the example.js file which include json example and search.txt file to explain my problem . I hope my problem will be more clear now. this is the link to onedrive where I kept both files. https://1drv.ms/f/s!AsW4Pb3Y55Qjb34OtQI7qQotLzc


回答1:


Is there any specific reason you are using ngram? Elasticsearch uses the same analyzer on the "query" as well as the text you index - unless search_analyzer is explicitly specified, as mentioned by @Adam in his answer. In your case it might be enough to use a standard tokenizer with a lowercase filter

I created an index with the following settings and mapping:

{
   "settings": {
      "analysis": {
         "analyzer": {
            "custom_analyzer": {
               "type": "custom",
               "tokenizer": "standard",
               "filter": [
                  "lowercase"
               ]
            }
         }
      }
   },
   "mappings": {
      "typehere": {
         "properties": {
            "name": {
               "type": "string",
               "analyzer": "custom_analyzer"
            },
            "description": {
               "type": "string",
               "analyzer": "custom_analyzer"
            }
         }
      }
   }
}

Indexed two documents Doc 1

PUT /test_index/test_mapping/1
    {
        "name" : "Sara Connor",
        "Description" : "My real name is Sarah Connor."
    }

Doc 2

PUT /test_index/test_mapping/2
    {
        "name" : "John Connor",
        "Description" : "I might save humanity someday."
    }

Do a simple search

POST /test_index/_search?query=sara
{
    "query" : {
        "match" : {
            "name" : "SARA"
        }
    }
}

And get back only the first document. I tried with "sara" and "Sara" also, same results.

{
  "took": 12,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.19178301,
    "hits": [
      {
        "_index": "test_index",
        "_type": "test_mapping",
        "_id": "1",
        "_score": 0.19178301,
        "_source": {
          "name": "Sara Connor",
          "Description": "My real name is Sarah Connor."
        }
      }
    ]
  }
}



回答2:


The analysis process is executed for full-text search fields (analysed) twice: first when data are stored and the second time when you search. It’s worth to say that input JSON will be returned in the same shape as an output from a search query. The analysis process is only used to create tokens for an inverted index. Key to your solution are the following steps:

  1. Create two analysers one with ngram filter and second analyser without ngram filter because you don’t need to analyse input search query using ngram because you have an exact value that you want to search.
  2. Define mappings correctly for your fields. There are two fields in the mapping that allow you to specify analysers. One is used for storage (analyzer) and second, is used for searching (search_analyzer) – if you specified only analyser field then specified analyser is used for index and search time.

You can read more about it here: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-analyzer.html

And your code should look like that:

PUT /my_index
{
   "settings": {
      "analysis": {
         "filter": {
            "ngram_filter": {
               "type": "ngram",
               "min_gram": 1,
               "max_gram": 5
            }
         },
         "analyzer": {
            "index_store_ngram": {
               "type": "custom",
               "tokenizer": "standard",
               "filter": [
                  "ngram_filter",
                  "lowercase"
               ]
            }
         }
      }
   },
   "mappings": {
      "my_type": {
         "properties": {
            "name": {
               "type": "string",
               "analyzer": "index_store_ngram",
               "search_analyzer": "standard"
            }
         }
      }
   }
}

post /my_index/my_type/1
{
     "name": "Sara_11_01"
}

GET /my_index/my_type/_search
{
    "query": {
        "match": {
           "name": "sara"
        }
    }
}

GET /my_index/my_type/_search
{
    "query": {
        "match": {
           "name": "SARA"
        }
    }
}

GET /my_index/my_type/_search
{
    "query": {
        "match": {
           "name": "SaRa"
        }
    }
}

Edit 1: updated code for a new example provided in the question



来源:https://stackoverflow.com/questions/40007971/case-insensitive-elasticsearch-with-uppercase-or-lowercase

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!