Fields not getting sorted in alphabetical order in elasticsearch

后端 未结 2 1037
一生所求
一生所求 2020-12-21 09:23

I have a few documents with the a name field in it. I am using analyzed version of the name field for search and not_analyzed for sorting purposes. The sorting

相关标签:
2条回答
  • 2020-12-21 09:58

    Digging down into Elasticsearch documents, I stumbled upon this:

    • Sorting and Collations

    Case-Insensitive Sorting

    Imagine that we have three user documents whose name fields contain Boffey, BROWN, and bailey, respectively. First we will apply the technique described in String Sorting and Multifields of using a not_analyzed field for sorting:

    PUT /my_index
    {
      "mappings": {
        "user": {
          "properties": {
            "name": {                    //1
              "type": "string",
              "fields": {
                "raw": {                 //2
                  "type":  "string",
                  "index": "not_analyzed"
                }
              }
            }
          }
        }
      }
    }
    
    1. The analyzed name field is used for search.
    2. The not_analyzed name.raw field is used for sorting.

    The preceding search request would return the documents in this order: BROWN, Boffey, bailey. This is known as lexicographical order as opposed to alphabetical order. Essentially, the bytes used to represent capital letters have a lower value than the bytes used to represent lowercase letters, and so the names are sorted with the lowest bytes first.

    That may make sense to a computer, but doesn’t make much sense to human beings who would reasonably expect these names to be sorted alphabetically, regardless of case. To achieve this, we need to index each name in a way that the byte ordering corresponds to the sort order that we want.

    In other words, we need an analyzer that will emit a single lowercase token:

    Following this logic, instead of storing raw document, you need to lowercase it using custom keyword analyzer:

    PUT /my_index
    {
      "settings" : {
        "analysis" : {
          "analyzer" : {
            "case_insensitive_sort" : {
              "tokenizer" : "keyword",
              "filter" : ["lowercase"]
            }
          }
        }
      },
      "mappings" : {
        "seing" : {
          "properties" : {
            "name" : {
              "type" : "string",
              "fields" : {
                "raw" : {
                  "type" : "string",
                  "analyzer" : "case_insensitive_sort"
                }
              }
            }
          }
        }
      }
    }
    

    Now ordering by name.raw should sort in alphabetical order, rather than lexicographical.

    Quick test done on my local machine using Marvel:

    Index structure:

    PUT /my_index
    {
      "settings": {
        "analysis": {
          "analyzer": {
            "case_insensitive_sort": {
              "tokenizer": "keyword",
              "filter": [
                "lowercase"
              ]
            }
          }
        }
      },
      "mappings": {
        "user": {
          "properties": {
            "name": {
              "type": "string",
              "fields": {
                "raw": {
                  "type": "string",
                  "index": "not_analyzed"
                },
                "keyword": {
                  "type": "string",
                  "analyzer": "case_insensitive_sort"
                }
              }
            }
          }
        }
      }
    }
    

    Test data:

    PUT /my_index/user/1
    {
      "name": "Tim"
    }
    
    PUT /my_index/user/2
    {
      "name": "TOM"
    }
    

    Query using raw field:

    POST /my_index/user/_search
    {
      "sort": "name.raw"
    }
    

    Result:

    {
      "_index" : "my_index",
      "_type" : "user",
      "_id" : "2",
      "_score" : null,
      "_source" : {
        "name" : "TOM"
      },
      "sort" : [
        "TOM"
      ]
    },
    {
      "_index" : "my_index",
      "_type" : "user",
      "_id" : "1",
      "_score" : null,
      "_source" : {
        "name" : "Tim"
      },
      "sort" : [
        "Tim"
      ]
    }
    

    Query using lowercased string:

    POST /my_index/user/_search
    {
      "sort": "name.keyword"
    }
    

    Result:

    {
      "_index" : "my_index",
      "_type" : "user",
      "_id" : "1",
      "_score" : null,
      "_source" : {
        "name" : "Tim"
      },
      "sort" : [
        "tim"
      ]
    },
    {
      "_index" : "my_index",
      "_type" : "user",
      "_id" : "2",
      "_score" : null,
      "_source" : {
        "name" : "TOM"
      },
      "sort" : [
        "tom"
      ]
    }
    

    I'm suspecting that second result is correct in your case.

    0 讨论(0)
  • 2020-12-21 10:15

    Since Elastic 5.2, you can use a normaliser to set up a case-insensitive sort.

    The normalizer property of keyword fields is similar to analyzer except that it guarantees that the analysis chain produces a single token.

    The normalizer is applied prior to indexing the keyword, as well as at search-time when the keyword field is searched via a query parser such as the match query.

    PUT index
    {
      "settings": {
        "analysis": {
          "normalizer": {
            "my_normalizer": {
              "type": "custom",
              "char_filter": [],
              "filter": ["lowercase", "asciifolding"]
            }
          }
        }
      },
      "mappings": {
        "type": {
          "properties": {
            "foo": {
              "type": "keyword",
              "normalizer": "my_normalizer"
            }
          }
        }
      }
    }
    
    PUT index/type/1
    {
      "foo": "BÀR"
    }
    
    PUT index/type/2
    {
      "foo": "bar"
    }
    
    PUT index/type/3
    {
      "foo": "baz"
    }
    
    POST index/_refresh
    
    GET index/_search
    {
      "query": {
        "match": {
          "foo": "BAR"
        }
      }
    }
    

    The above query matches documents 1 and 2 since BÀR is converted to bar at both index and query time.

    {
      "took": $body.took,
      "timed_out": false,
      "_shards": {
        "total": 5,
        "successful": 5,
        "failed": 0
      },
      "hits": {
        "total": 2,
        "max_score": 0.2876821,
        "hits": [
          {
            "_index": "index",
            "_type": "type",
            "_id": "2",
            "_score": 0.2876821,
            "_source": {
              "foo": "bar"
            }
          },
          {
            "_index": "index",
            "_type": "type",
            "_id": "1",
            "_score": 0.2876821,
            "_source": {
              "foo": "BÀR"
            }
          }
        ]
      }
    }
    

    Also, the fact that keywords are converted prior to indexing also means that aggregations return normalized values:

    GET index/_search
    {
      "size": 0,
      "aggs": {
        "foo_terms": {
          "terms": {
            "field": "foo"
          }
        }
      }
    }
    

    returns

    {
      "took": 43,
      "timed_out": false,
      "_shards": {
        "total": 5,
        "successful": 5,
        "failed": 0
      },
      "hits": {
        "total": 3,
        "max_score": 0.0,
        "hits": []
      },
      "aggregations": {
        "foo_terms": {
          "doc_count_error_upper_bound": 0,
          "sum_other_doc_count": 0,
          "buckets": [
            {
              "key": "bar",
              "doc_count": 2
            },
            {
              "key": "baz",
              "doc_count": 1
            }
          ]
        }
      }
    }
    

    Source: Normaliser

    0 讨论(0)
提交回复
热议问题