I have some text in elastic search containing urls in various formats (http://www, www.) what I want to do is to search for all texts containing e.g., google.com.
For the current search I use something like this query:
query = { "query": { "bool": { "must": [{ "range": { "cdate": { "gt": dfrom, "lte": dto } } }, { "query_string":{ "default_operator": "AND", "default_field": "text", "analyze_wildcard":"true", "query": searchString } } ] } }}
But a query looking like google.com never returns any result, searching for e.g., the term "test" works fine (without "). I do want to use query_string because I'd like to use boolean operators but I really need to be able to search substrings not only for whole words.
Thank you !
It is true indeed that http://www.google.com
will be tokenized by the standard analyzer into http
and www.google.com
and thus google.com
will not be found.
So the standard analyzer alone will not help here, we need a token filter that will correctly transform URL tokens. Another way if your text
field only contained URLs would have been to use the UAX Email URL tokenizer, but since the field can contain any other text (i.e. user comments), it won't work.
Fortunately, there's a new plugin around called analysis-url which provides an URL token filter, and this is exactly what we need (after a small modification I begged for, thanks @jlinn ;-) )
First, you need to install the plugin:
bin/plugin install https://github.com/jlinn/elasticsearch-analysis-url/releases/download/v2.2.0/elasticsearch-analysis-url-2.2.0.zip
Then, we can start playing. We need to create the proper analyzer for your text
field:
curl -XPUT localhost:9200/test -d '{ "settings": { "analysis": { "filter": { "url_host": { "type": "url", "part": "host", "url_decode": true, "passthrough": true } }, "analyzer": { "url_host": { "filter": [ "url_host" ], "tokenizer": "whitespace" } } } }, "mappings": { "url": { "properties": { "text": { "type": "string", "analyzer": "url_host" } } } } }'
With this analyzer and mapping, we can properly index the host you want to be able to search for. For instance, let's analyze the string blabla bla http://www.google.com blabla
using our new analyzer.
curl -XGET 'localhost:9200/urls/_analyze?analyzer=url_host&pretty' -d 'blabla bla http://www.google.com blabla'
We'll get the following tokens:
{ "tokens" : [ { "token" : "blabla", "start_offset" : 0, "end_offset" : 0, "type" : "word", "position" : 0 }, { "token" : "bla", "start_offset" : 0, "end_offset" : 0, "type" : "word", "position" : 1 }, { "token" : "www.google.com", "start_offset" : 0, "end_offset" : 0, "type" : "word", "position" : 2 }, { "token" : "google.com", "start_offset" : 0, "end_offset" : 0, "type" : "word", "position" : 3 }, { "token" : "com", "start_offset" : 0, "end_offset" : 0, "type" : "word", "position" : 4 }, { "token" : "blabla", "start_offset" : 0, "end_offset" : 0, "type" : "word", "position" : 5 } ] }
As you can see the http://www.google.com
part will be tokenized into:
www.google.com
google.com
i.e. what you expected com
So now if your searchString
is google.com
you'll be able to find all the documents which have a text
field containing google.com
(or www.google.com
).
Full-text search is always about exact matches in the inverted index, unless you perform a wild-card search