I have a few documents with the a name field in it. I am using analyzed version of the name field for search and not_analyzed
for sorting purposes. The sorting
Digging down into Elasticsearch documents, I stumbled upon this:
Case-Insensitive Sorting
Imagine that we have three user documents whose name fields contain Boffey, BROWN, and bailey, respectively. First we will apply the technique described in String Sorting and Multifields of using a not_analyzed field for sorting:
PUT /my_index
{
"mappings": {
"user": {
"properties": {
"name": { //1
"type": "string",
"fields": {
"raw": { //2
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
}
analyzed
name
field is used for search.not_analyzed
name.raw
field is used for sorting.The preceding search request would return the documents in this order: BROWN, Boffey, bailey. This is known as lexicographical order as opposed to alphabetical order. Essentially, the bytes used to represent capital letters have a lower value than the bytes used to represent lowercase letters, and so the names are sorted with the lowest bytes first.
That may make sense to a computer, but doesn’t make much sense to human beings who would reasonably expect these names to be sorted alphabetically, regardless of case. To achieve this, we need to index each name in a way that the byte ordering corresponds to the sort order that we want.
In other words, we need an analyzer that will emit a single lowercase token:
Following this logic, instead of storing raw document, you need to lowercase it using custom keyword analyzer:
PUT /my_index
{
"settings" : {
"analysis" : {
"analyzer" : {
"case_insensitive_sort" : {
"tokenizer" : "keyword",
"filter" : ["lowercase"]
}
}
}
},
"mappings" : {
"seing" : {
"properties" : {
"name" : {
"type" : "string",
"fields" : {
"raw" : {
"type" : "string",
"analyzer" : "case_insensitive_sort"
}
}
}
}
}
}
}
Now ordering by name.raw
should sort in alphabetical order, rather than lexicographical.
Quick test done on my local machine using Marvel:
Index structure:
PUT /my_index
{
"settings": {
"analysis": {
"analyzer": {
"case_insensitive_sort": {
"tokenizer": "keyword",
"filter": [
"lowercase"
]
}
}
}
},
"mappings": {
"user": {
"properties": {
"name": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
},
"keyword": {
"type": "string",
"analyzer": "case_insensitive_sort"
}
}
}
}
}
}
}
Test data:
PUT /my_index/user/1
{
"name": "Tim"
}
PUT /my_index/user/2
{
"name": "TOM"
}
Query using raw field:
POST /my_index/user/_search
{
"sort": "name.raw"
}
Result:
{
"_index" : "my_index",
"_type" : "user",
"_id" : "2",
"_score" : null,
"_source" : {
"name" : "TOM"
},
"sort" : [
"TOM"
]
},
{
"_index" : "my_index",
"_type" : "user",
"_id" : "1",
"_score" : null,
"_source" : {
"name" : "Tim"
},
"sort" : [
"Tim"
]
}
Query using lowercased string:
POST /my_index/user/_search
{
"sort": "name.keyword"
}
Result:
{
"_index" : "my_index",
"_type" : "user",
"_id" : "1",
"_score" : null,
"_source" : {
"name" : "Tim"
},
"sort" : [
"tim"
]
},
{
"_index" : "my_index",
"_type" : "user",
"_id" : "2",
"_score" : null,
"_source" : {
"name" : "TOM"
},
"sort" : [
"tom"
]
}
I'm suspecting that second result is correct in your case.
The normalizer
property of keyword
fields is similar to
analyzer
except that it guarantees that the analysis chain
produces a single token.
The normalizer
is applied prior to indexing the keyword, as well as at
search-time when the keyword
field is searched via a query parser such as
the match
query.
PUT index
{
"settings": {
"analysis": {
"normalizer": {
"my_normalizer": {
"type": "custom",
"char_filter": [],
"filter": ["lowercase", "asciifolding"]
}
}
}
},
"mappings": {
"type": {
"properties": {
"foo": {
"type": "keyword",
"normalizer": "my_normalizer"
}
}
}
}
}
PUT index/type/1
{
"foo": "BÀR"
}
PUT index/type/2
{
"foo": "bar"
}
PUT index/type/3
{
"foo": "baz"
}
POST index/_refresh
GET index/_search
{
"query": {
"match": {
"foo": "BAR"
}
}
}
The above query matches documents 1 and 2 since BÀR
is converted to bar
at
both index and query time.
{
"took": $body.took,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.2876821,
"hits": [
{
"_index": "index",
"_type": "type",
"_id": "2",
"_score": 0.2876821,
"_source": {
"foo": "bar"
}
},
{
"_index": "index",
"_type": "type",
"_id": "1",
"_score": 0.2876821,
"_source": {
"foo": "BÀR"
}
}
]
}
}
Also, the fact that keywords are converted prior to indexing also means that aggregations return normalized values:
GET index/_search
{
"size": 0,
"aggs": {
"foo_terms": {
"terms": {
"field": "foo"
}
}
}
}
returns
{
"took": 43,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 0.0,
"hits": []
},
"aggregations": {
"foo_terms": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "bar",
"doc_count": 2
},
{
"key": "baz",
"doc_count": 1
}
]
}
}
}
Source: Normaliser