问题
I'm currently running haystack with an elasticsearch backend, and now I'm building an autocomplete for cities names. The problem is that SearchQuerySet is giving me different results, which from my perspective are wrong, than the same query executed directly in elasticsearch, which are for me the expected results.
I'm using: Django 1.5.4, django-haystack 2.1.0, pyelasticsearch 0.6.1, elasticsearch 0.90.3
Using the following example data:
- Midfield
- Midland City
- Midway
- Minor
- Minturn
- Miami Beach
Using either
SearchQuerySet().models(Geoname).filter(name_auto='mid')
or
SearchQuerySet().models(Geoname).autocomplete(name_auto='mid')
The result returns always all the 6 names, including Min* and Mia*...however, querying elasticsearch directly returns the right data:
"query": {
"filtered" : {
"query" : {
"match_all": {}
},
"filter" : {
"term": {"name_auto": "mid"}
}
}
}
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 1,
"hits": [
{
"_index": "haystack",
"_type": "modelresult",
"_id": "csi.geoname.4075977",
"_score": 1,
"_source": {
"name_auto": "Midfield",
}
},
{
"_index": "haystack",
"_type": "modelresult",
"_id": "csi.geoname.4075984",
"_score": 1,
"_source": {
"name_auto": "Midland City",
}
},
{
"_index": "haystack",
"_type": "modelresult",
"_id": "csi.geoname.4075989",
"_score": 1,
"_source": {
"name_auto": "Midway",
}
}
]
}
}
The behavior is the same with different examples. My guess is that trough haystack the string it's being split and analyzed by all possible "min_gram" groups of characters and that's why it returns wrong results.
I'm not sure if I am doing or understanding something wrong, and if is this how haystack is supposed to work, but I need that haystack results match the elasticsearch results.
So, How can I fix the issue or make it works ?
My summarized objects look as follow:
Model:
class Geoname(models.Model):
id = models.IntegerField(primary_key=True)
name = models.CharField(max_length=255)
Index:
class GeonameIndex(indexes.SearchIndex, indexes.Indexable):
text = indexes.CharField(document=True, use_template=True)
name_auto = indexes.EdgeNgramField(model_attr='name')
def get_model(self):
return Geoname
Mapping:
modelresult: {
_boost: {
name: "boost",
null_value: 1
},
properties: {
django_ct: {
type: "string"
},
django_id: {
type: "string"
},
name_auto: {
type: "string",
store: true,
term_vector: "with_positions_offsets",
analyzer: "edgengram_analyzer"
}
}
}
Thank you.
回答1:
After a deep look into the code I found that the search generated by haystack was:
{
"query":{
"filtered":{
"filter":{
"fquery":{
"query":{
"query_string":{
"query": "django_ct:(csi.geoname)"
}
},
"_cache":false
}
},
"query":{
"query_string":{
"query": "name_auto:(mid)",
"default_operator":"or",
"default_field":"text",
"auto_generate_phrase_queries":true,
"analyze_wildcard":true
}
}
}
},
"from":0,
"size":6
}
Running this query in elasticsearch was given me as result the same 6 objects that haystack was showing...but If I added to the "query_string"
"analyzer": "standard"
it worked as desired. So the idea was to be able to setup a different search analyzer for the field.
Based on the @user954994 answer's link and the explanation on this post, what I finally did to make it work was:
- I created my custom elasticsearch backend, adding a new custom analyzer based on the standard one.
- I added a custom EdgeNgramField, enabling the way to setup an specific analyzer for index (index_analyzer) and another analyzer for search (search_analyzer).
So, my new settings are:
ELASTICSEARCH_INDEX_SETTINGS = {
'settings': {
"analysis": {
"analyzer": {
"ngram_analyzer": {
"type": "custom",
"tokenizer": "lowercase",
"filter": ["haystack_ngram"]
},
"edgengram_analyzer": {
"type": "custom",
"tokenizer": "lowercase",
"filter": ["haystack_edgengram"]
},
"suggest_analyzer": {
"type":"custom",
"tokenizer":"standard",
"filter":[
"standard",
"lowercase",
"asciifolding"
]
},
},
"tokenizer": {
"haystack_ngram_tokenizer": {
"type": "nGram",
"min_gram": 3,
"max_gram": 15,
},
"haystack_edgengram_tokenizer": {
"type": "edgeNGram",
"min_gram": 2,
"max_gram": 15,
"side": "front"
}
},
"filter": {
"haystack_ngram": {
"type": "nGram",
"min_gram": 3,
"max_gram": 15
},
"haystack_edgengram": {
"type": "edgeNGram",
"min_gram": 2,
"max_gram": 15
}
}
}
}
}
My new custom build_schema method looks as follow:
def build_schema(self, fields):
content_field_name, mapping = super(ConfigurableElasticBackend,
self).build_schema(fields)
for field_name, field_class in fields.items():
field_mapping = mapping[field_class.index_fieldname]
index_analyzer = getattr(field_class, 'index_analyzer', None)
search_analyzer = getattr(field_class, 'search_analyzer', None)
field_analyzer = getattr(field_class, 'analyzer', self.DEFAULT_ANALYZER)
if field_mapping['type'] == 'string' and field_class.indexed:
if not hasattr(field_class, 'facet_for') and not field_class.field_type in('ngram', 'edge_ngram'):
field_mapping['analyzer'] = field_analyzer
if index_analyzer and search_analyzer:
field_mapping['index_analyzer'] = index_analyzer
field_mapping['search_analyzer'] = search_analyzer
del(field_mapping['analyzer'])
mapping.update({field_class.index_fieldname: field_mapping})
return (content_field_name, mapping)
And after rebuild index my mapping looks as below:
modelresult: {
_boost: {
name: "boost",
null_value: 1
},
properties: {
django_ct: {
type: "string"
},
django_id: {
type: "string"
},
name_auto: {
type: "string",
store: true,
term_vector: "with_positions_offsets",
index_analyzer: "edgengram_analyzer",
search_analyzer: "suggest_analyzer"
}
}
}
Now everything is working as expected!
UPDATE:
Bellow you'll find the code to clarify this part:
- I created my custom elasticsearch backend, adding a new custom analyzer based on the standard one.
- I added a custom EdgeNgramField, enabling the way to setup an specific analyzer for index (index_analyzer) and another analyzer for search (search_analyzer).
Into my app search_backends.py:
from django.conf import settings
from haystack.backends.elasticsearch_backend import ElasticsearchSearchBackend
from haystack.backends.elasticsearch_backend import ElasticsearchSearchEngine
from haystack.fields import EdgeNgramField as BaseEdgeNgramField
# Custom Backend
class CustomElasticBackend(ElasticsearchSearchBackend):
DEFAULT_ANALYZER = None
def __init__(self, connection_alias, **connection_options):
super(CustomElasticBackend, self).__init__(
connection_alias, **connection_options)
user_settings = getattr(settings, 'ELASTICSEARCH_INDEX_SETTINGS', None)
self.DEFAULT_ANALYZER = getattr(settings, 'ELASTICSEARCH_DEFAULT_ANALYZER', "snowball")
if user_settings:
setattr(self, 'DEFAULT_SETTINGS', user_settings)
def build_schema(self, fields):
content_field_name, mapping = super(CustomElasticBackend,
self).build_schema(fields)
for field_name, field_class in fields.items():
field_mapping = mapping[field_class.index_fieldname]
index_analyzer = getattr(field_class, 'index_analyzer', None)
search_analyzer = getattr(field_class, 'search_analyzer', None)
field_analyzer = getattr(field_class, 'analyzer', self.DEFAULT_ANALYZER)
if field_mapping['type'] == 'string' and field_class.indexed:
if not hasattr(field_class, 'facet_for') and not field_class.field_type in('ngram', 'edge_ngram'):
field_mapping['analyzer'] = field_analyzer
if index_analyzer and search_analyzer:
field_mapping['index_analyzer'] = index_analyzer
field_mapping['search_analyzer'] = search_analyzer
del(field_mapping['analyzer'])
mapping.update({field_class.index_fieldname: field_mapping})
return (content_field_name, mapping)
class CustomElasticSearchEngine(ElasticsearchSearchEngine):
backend = CustomElasticBackend
# Custom field
class CustomFieldMixin(object):
def __init__(self, **kwargs):
self.analyzer = kwargs.pop('analyzer', None)
self.index_analyzer = kwargs.pop('index_analyzer', None)
self.search_analyzer = kwargs.pop('search_analyzer', None)
super(CustomFieldMixin, self).__init__(**kwargs)
class CustomEdgeNgramField(CustomFieldMixin, BaseEdgeNgramField):
pass
My index definition goes something like:
class MyIndex(indexes.SearchIndex, indexes.Indexable):
text = indexes.CharField(document=True, use_template=True)
name_auto = CustomEdgeNgramField(model_attr='name', index_analyzer="edgengram_analyzer", search_analyzer="suggest_analyzer")
And finally, settings uses of course the custom backend for the haystack connection definition:
HAYSTACK_CONNECTIONS = {
'default': {
'ENGINE': 'my_app.search_backends.CustomElasticSearchEngine',
'URL': 'http://localhost:9200',
'INDEX_NAME': 'index'
},
}
回答2:
Well, I had a similar problem and my strategy was make a custom backend.
The complete instructions can be found on:
http://www.wellfireinteractive.com/blog/custom-haystack-elasticsearch-backend/
It works to me !
Hope this helps.
来源:https://stackoverflow.com/questions/20430449/django-haystack-edgengramfield-given-different-results-than-elasticsearch