How to index a pdf file in Elasticsearch 5.0.0 with ingest-attachment plugin?

后端未结

关注

 1  1837

I\'m new to Elasticsearch and I read here https://www.elastic.co/guide/en/elasticsearch/plugins/master/mapper-attachments.html that the mapper-attachments plugin is deprecat

相关标签:

1条回答

野的像风

2020-12-02 12:25
You need to make sure you have created your ingest pipeline with:
```
PUT _ingest/pipeline/attachment
{
  "description" : "Extract attachment information",
  "processors" : [
    {
      "attachment" : {
        "field" : "data",
        "indexed_chars" : -1
      }
    }
  ]
}
```
Then you can make a PUT not POST to your index using the pipeline you've created.
```
PUT my_index/my_type/my_id?pipeline=attachment
{
  "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
}
```
In your example, should be something like:
```
curl -H 'Content-Type: application/pdf' -XPUT localhost:9200/test/1?pipeline=attachment -d @/cygdrive/c/test/test.pdf
```
Remembering that the PDF content must be base64 encoded.

Hope it will help you.

Edit 1 Please make sure to read these, it helped me a lot:

Elastic Ingest

Ingest Plugin

Ingest Presentation

Edit 2

Also, you must have ingest-attachment plugin installed.
```
./bin/elasticsearch-plugin install ingest-attachment
```
Edit 3

Please, before you create your ingest processor (attachment), create your index, map with the fields you will use and make sure you have the data field in your map (same name of the "field" in your attachment processor), so ingest will process and fullfill your data field with your pdf content.

I inserted the indexed_chars option in the ingest processor, with -1 value, so you can index large pdf files.

Edit 4

The mapping should be something like that:
```
PUT my_index
{ 
    "mappings" : { 
        "my_type" : { 
            "properties" : { 
                "attachment.data" : { 
                    "type": "text", 
                    "analyzer" : "brazilian" 
                } 
            } 
        } 
    } 
}
```
In this case, I use the brazilian filter, but you can remove that or use your own.
0 讨论(0)
发布评论:

提交评论
- 加载中...