问题
Given I have specified my html strip char filter in my custom analyser
When I index a document with html content
Then I expect the html to be strip out of the indexed content
And on retrieval the returned doc from the index shoult not contain hmtl
ACTUAL: The indexed doc contained html The retrieved doc contained html
I have tried specifying the analyzer as index_analyzer as one would expect and a few others out of desperation search_analyzer and analyzer. Non seem to have any effect on the doc being indexed or retrieve.
Test Doc Indexing against HTML_Strip Analysed field :
REQUEST : Example POST document with html content
POST /html_poc_v2/html_poc_type/02
{
"description": "Description <p>Some déjà vu <a href=\"http://somedomain.com>\">website</a>",
"title": "Title <p>Some déjà vu <a href=\"http://somedomain.com>\">website</a>",
"body": "Body <p>Some déjà vu <a href=\"http://somedomain.com>\">website</a>"
}
Expected : indexed data to have being parsed through the html analyser. Actual : data is indexed with html
RESPONSE
{
"_index": "html_poc_v2", "_type": "html_poc_type", "_id": "02", ...
"_source": {
"description": "Description <p>Some déjà vu <a href=\"http://somedomain.com>\">website</a>",
"title": "Title <p>Some déjà vu <a href=\"http://somedomain.com>\">website</a>",
"body": "Body <p>Some déjà vu <a href=\"http://somedomain.com>\">website</a>"
}
}
Settings and Doc Mapping
PUT /html_poc_v2
{
"settings": {
"analysis": {
"analyzer": {
"my_html_analyzer": {
"type": "custom",
"tokenizer": "standard",
"char_filter": [
"html_strip"
]
}
}
},
"mappings": {
"html_poc_type": {
"properties": {
"body": {
"type": "string",
"analyzer": "my_html_analyzer"
},
"description": {
"type": "string",
"analyzer": "my_html_analyzer"
},
"title": {
"type": "string",
"search_analyser": "my_html_analyzer"
},
"urlTitle": {
"type": "string"
}
}
}
}
}
}
Test to proof Custom Analyser works perfectly:
REQUEST
GET /html_poc_v2/_analyze?analyzer=my_html_analyzer
{<p>Some déjà vu <a href="http://somedomain.com>">website</a>}
Response
{
"tokens": [
{
"token": "Some",… "position": 1
},
{
"token": "déjà",… "position": 2
},
{
"token": "vu",… "position": 3
},
{
"token": "website",… "position": 4
}
]
}
Under the hood
going under the hood with an in-line script proofs further that my html analyser must have been skipped
REQUEST
GET /html_poc_v2/html_poc_type/_search?pretty=true
{
"query" : {
"match_all" : { }
},
"script_fields": {
"terms" : {
"script": "doc[field].values",
"params": {
"field": "title"
}
}
}
}
RESPONSE
{ …
"hits": { ..
"hits": [
{
"_index": "html_poc_v2",
"_type": "html_poc_type",
…
"fields": {
"terms": [
[
"a",
"agrave",
"d",
"eacute",
"href",
"http",
"j",
"p",
"some",
"somedomain.com",
"title",
"vu",
"website"
]
]
}
}
]
}
}
Similar to this question here : Why HTML tag is searchable even if it was filtered in elastic search
I have also read this amazing doc : https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-htmlstrip-charfilter.html
ES version : 1.7.2
Please Help.
回答1:
You are confusing the "_source" field in the response to return what is being analyzed and indexed.
It looks like your expectation is that the _source
field in response returns the analyzed document. This is incorrect.
From the documentation ;
The _source field contains the original JSON document body that was passed at index time. The _source field itself is not indexed (and thus is not searchable), but it is stored so that it can be returned when executing fetch requests, like get or search.
Ideally in the above case wherein you want to format the source data for presentation purposes it should be done at the client end.
However that being said one way to achieve it for the above use case is using script fields and keyword-tokenizer as follows :
PUT test
{
"settings": {
"analysis": {
"analyzer": {
"my_html_analyzer": {
"type": "custom",
"tokenizer": "standard",
"char_filter": [
"html_strip"
]
},
"parsed_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"char_filter": [
"html_strip"
]
}
}
}
},
"mappings": {
"test": {
"properties": {
"body": {
"type": "string",
"analyzer": "my_html_analyzer",
"fields": {
"parsed": {
"type": "string",
"analyzer": "parsed_analyzer"
}
}
}
}
}
}
}
PUT test/test/1
{
"body" : "Title <p> Some déjà vu <a href='http://somedomain.com'> website </a> <span> this is inline </span></p> "
}
GET test/_search
{
"query" : {
"match_all" : { }
},
"script_fields": {
"terms" : {
"script": "doc[field].values",
"params": {
"field": "body.parsed"
}
}
}
}
Result:
{
"_index": "test",
"_type": "test",
"_id": "1",
"_score": 1,
"fields": {
"terms": [
"Title \n Some déjà vu website this is inline \n "
]
}
}
note I believe the above is a bad idea since stripping the html tags could be easily achived on the client end and you would have much more control with regard to formatting than depending on a work around such as this. More importantly it maybe performant doing it on the client side.
来源:https://stackoverflow.com/questions/37351900/elasticsearch-strip-html-tags-before-indexing-docs-with-html-strip-filter-not