Elasticsearch custom analyzer for hyphens, underscores, and numbers

柔情痞子 提交于 2019-12-04 13:34:56

You could change your analysis to use a pattern analyzer that discards the digits and under scores:

{
   "analysis": {
      "analyzer": {
          "word_only": {
              "type": "pattern",
              "pattern": "([^\p{L}]+)"
          }
       }
    }
}

Using the analyze API:

curl -XGET 'localhost:9200/{yourIndex}/_analyze?analyzer=word_only&pretty=true' -d 'WIN_8_ENT_1'

returns:

"tokens" : [ {
    "token" : "win",
    "start_offset" : 0,
    "end_offset" : 3,
    "type" : "word",
    "position" : 1
}, {
    "token" : "ent",
    "start_offset" : 6,
    "end_offset" : 9,
    "type" : "word",
    "position" : 2
} ]

Your mapping would become:

{
    "event": {
        "properties": {
            "ipaddress": {
                 "type": "string"
             },
             "hostname": {
                 "type": "string",
                 "analyzer": "word_only",
                 "fields": {
                     "raw": {
                         "type": "string",
                         "index": "not_analyzed"
                     }
                 }
             }
         }
    }
}

You can use a multi_match query to get the results you want:

{
    "query": {
        "multi_match": {
            "fields": [
                "hostname",
                "hostname.raw"
            ],
            "query": "WIN_1"
       }
   }
}

Here's the analyzer and queries I ended up with:

{
    "mappings": {
        "event": {
            "properties": {
                "ipaddress": {
                    "type": "string"
                },
                "hostname": {
                    "type": "string",
                    "analyzer": "hostname_analyzer",
                    "fields": {
                        "raw": {
                            "type": "string",
                            "index": "not_analyzed"
                        }
                    }
                }
            }
        }
    },
    "settings": {
        "analysis": {
            "filter": {
                "hostname_filter": {
                    "type": "pattern_capture",
                    "preserve_original": 0,
                    "patterns": [
                        "(\\p{Ll}{3,})"
                    ]
                }
            },
            "analyzer": {
                "hostname_analyzer": {
                    "type": "custom",
                    "tokenizer": "whitespace",
                    "filter": [  "lowercase", "hostname_filter" ]
                }
            }
        }
    }
}

Queries: Find host name starting with:

{
    "query": {
        "prefix": {
            "hostname.raw": "WIN_8"
        }
    }
}

Find host name containing:

{
    "query": {
        "multi_match": {
            "fields": [
                "hostname",
                "hostname.raw"
            ],
            "query": "WIN"
       }
   }
}

Thanks to Dan for getting me in the right direction.

When ES 1.4 is released, there will be a new filter called 'keep types' that will allow you to only keep certain types once the string is tokenized. (i.e. keep words only, numbers only, etc).

Check it out here: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-keep-types-tokenfilter.html#analysis-keep-types-tokenfilter

This may be a more convenient solution for your needs in the future

It looks like you want to apply two different types of searches on your hostname field. One for exact matches, and one for a variation of wildcard (maybe in your specific case, a prefix query).

After trying to implement all types of different searches using several different analyzers, I've found it sometimes simpler to add another field to represent each type of search you want to do. Is there a reason you do not want to add another field like the following:

{ "ipaddress": "192.168.1.253", "hostname": "WIN_8_ENT_1" "system": "WIN" }

Otherwise, you could consider writing your own custom filter that does effectively the same thing under the hood. Your filter will read in your hostname field and index the exact keyword and a substring that matches your stemming pattern (e.g. WIN in WIN_8_ENT_1).

I do not think there is any existing analyzer/filter combination that can do what you are looking for, assuming I have understood your requirements correctly.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!