Elasticsearch aggregate on URL hostname

让人想犯罪 __ 提交于 2019-12-06 05:48:56

问题


I am indexing documents with a field containing a url:

[
    'myUrlField' => 'http://google.com/foo/bar'
]

Now what I´d like to get out of elasticsearch is an aggregation on the url field.

curl -XGET 'http://localhost:9200/myIndex/_search?pretty' -d '{
  "facets": {
    "groupByMyUrlField": {
      "terms": {
        "field": "myUrlField"
      }
    }
  }
}'

This is all well and good, but the default analyzer tokenizes the field so that each part of the url is a token, so I get hits for http, google.com, foo and bar. But basically I am only interested in the hostname of the url, the google.com.

Can I use facets to group by a specific token?

"field": "myUrlField.0"

or something like that?

Querying for the "not_analyzed" index is also no good because I want to group by hostname, and not by unique urls.

Would love to be able to do this in elasticsearch and not in my client code. Thanks


回答1:


Here is a way to aggregate urls by domains:

First you tokenize the full url as a single token using a keyword tokenizer (which works the same as not_analyzed under the hood), then you extract the domain with a regex, using a pattern capture token filter. Finally we discard the original full url token thanks to preserve_original option.

Which leads to:

{
  "settings": {
    "analysis": {
      "filter": {
        "capture_domain_filter": {
          "type": "pattern_capture",
          "preserve_original": false,
          "flags": "CASE_INSENSITIVE",
          "patterns": [
            "https?:\/\/([^/]+)"
          ]
        }
      },
      "analyzer": {
        "domain_analyzer": {
          "type": "custom",
          "tokenizer": "keyword",
          "filter": [
            "capture_domain_filter"
          ]
        }
      }
    }
  },
  "mappings": {
    "weblink": {
      "properties": {
        "url": {
          "type": "string",
          "analyzer": "domain_analyzer"
        }
      }
    }
  }
}

We check how our urls are tokenized:

curl -sXGET http://localhost:9200/url_analyzer/_analyze\?analyzer\=domain_analyzer\&pretty -d 'http://en.wikipedia.org/wiki/Wikipedia' | grep token
  "tokens" : [ {
    "token" : "en.wikipedia.org",

This looks good, now let's aggregate our urls by domains using latest aggregations features (which will deprecate facets in near future).

curl -XGET "http://localhost:9200/url_analyzer/_search?pretty" -d'
{
  "aggregations": {
    "tokens": {
      "terms": {
        "field": "url"
      }
    }
  }
}'

Output:

"aggregations" : {
    "tokens" : {
      "buckets" : [ {
        "key" : "en.wikipedia.org",
        "doc_count" : 2
      }, {
        "key" : "www.elasticsearch.org",
        "doc_count" : 1
      } ]
    }

From here you can go further and apply an additional shingle token filter on top of this to match queries such as "en.wikipedia", "wikipedia.org", if you want to avoid exact matches while searching for a domain.



来源:https://stackoverflow.com/questions/23867657/elasticsearch-aggregate-on-url-hostname

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!