Elasticsearch aggregate on URL hostname

问题

I am indexing documents with a field containing a url:

[
    'myUrlField' => 'http://google.com/foo/bar'
]

Now what I´d like to get out of elasticsearch is an aggregation on the url field.

curl -XGET 'http://localhost:9200/myIndex/_search?pretty' -d '{
  "facets": {
    "groupByMyUrlField": {
      "terms": {
        "field": "myUrlField"
      }
    }
  }
}'

This is all well and good, but the default analyzer tokenizes the field so that each part of the url is a token, so I get hits for http, google.com, foo and bar. But basically I am only interested in the hostname of the url, the google.com.

Can I use facets to group by a specific token?

"field": "myUrlField.0"

or something like that?

Querying for the "not_analyzed" index is also no good because I want to group by hostname, and not by unique urls.

Would love to be able to do this in elasticsearch and not in my client code. Thanks

回答1:

Here is a way to aggregate urls by domains:

First you tokenize the full url as a single token using a keyword tokenizer (which works the same as not_analyzed under the hood), then you extract the domain with a regex, using a pattern capture token filter. Finally we discard the original full url token thanks to preserve_original option.

Which leads to:

{
  "settings": {
    "analysis": {
      "filter": {
        "capture_domain_filter": {
          "type": "pattern_capture",
          "preserve_original": false,
          "flags": "CASE_INSENSITIVE",
          "patterns": [
            "https?:\/\/([^/]+)"
          ]
        }
      },
      "analyzer": {
        "domain_analyzer": {
          "type": "custom",
          "tokenizer": "keyword",
          "filter": [
            "capture_domain_filter"
          ]
        }
      }
    }
  },
  "mappings": {
    "weblink": {
      "properties": {
        "url": {
          "type": "string",
          "analyzer": "domain_analyzer"
        }
      }
    }
  }
}

We check how our urls are tokenized:

curl -sXGET http://localhost:9200/url_analyzer/_analyze\?analyzer\=domain_analyzer\&pretty -d 'http://en.wikipedia.org/wiki/Wikipedia' | grep token
  "tokens" : [ {
    "token" : "en.wikipedia.org",

This looks good, now let's aggregate our urls by domains using latest aggregations features (which will deprecate facets in near future).

curl -XGET "http://localhost:9200/url_analyzer/_search?pretty" -d'
{
  "aggregations": {
    "tokens": {
      "terms": {
        "field": "url"
      }
    }
  }
}'

Output:

"aggregations" : {
    "tokens" : {
      "buckets" : [ {
        "key" : "en.wikipedia.org",
        "doc_count" : 2
      }, {
        "key" : "www.elasticsearch.org",
        "doc_count" : 1
      } ]
    }

From here you can go further and apply an additional shingle token filter on top of this to match queries such as "en.wikipedia", "wikipedia.org", if you want to avoid exact matches while searching for a domain.

来源：https://stackoverflow.com/questions/23867657/elasticsearch-aggregate-on-url-hostname

标签

ElasticSearch

tokenize