Reindexing Elastic search via Bulk API, scan and scroll

后端未结

关注

 4  796

I am trying to re-index my Elastic search setup, currently looking at the Elastic search documentation and an example using the Python API

I\'m a little bit confused

相关标签:

4条回答

后悔当初

2020-12-14 04:32

Hi you can use the scroll api to go through all the documents in the most efficient way. Using the scroll_id you can find a session that is stored on the server for your specific scroll request. So you need to provide the scroll_id with each request to obtain more items.

The bulk api is for more efficient indexing documents. When copying and index you need both, but they are not really related.

I do have some java code that might help you to get a better idea about how it works.

    public void reIndex() {
    logger.info("Start creating a new index based on the old index.");

    SearchResponse searchResponse = client.prepareSearch(MUSIC_INDEX)
            .setQuery(matchAllQuery())
            .setSearchType(SearchType.SCAN)
            .setScroll(createScrollTimeoutValue())
            .setSize(SCROLL_SIZE).execute().actionGet();

    BulkProcessor bulkProcessor = BulkProcessor.builder(client,
            createLoggingBulkProcessorListener()).setBulkActions(BULK_ACTIONS_THRESHOLD)
            .setConcurrentRequests(BULK_CONCURRENT_REQUESTS)
            .setFlushInterval(createFlushIntervalTime())
            .build();

    while (true) {
        searchResponse = client.prepareSearchScroll(searchResponse.getScrollId())
                .setScroll(createScrollTimeoutValue()).execute().actionGet();

        if (searchResponse.getHits().getHits().length == 0) {
            logger.info("Closing the bulk processor");
            bulkProcessor.close();
            break; //Break condition: No hits are returned
        }

        for (SearchHit hit : searchResponse.getHits()) {
            IndexRequest request = new IndexRequest(MUSIC_INDEX_NEW, hit.type(), hit.id());
            request.source(hit.sourceRef());
            bulkProcessor.add(request);
        }
    }
}

0 讨论(0)

栀梦

2020-12-14 04:35

For anyone who runs into this problem, you can use the following API from the Python client to reindex:

https://elasticsearch-py.readthedocs.org/en/master/helpers.html#elasticsearch.helpers.reindex

This would help you avoid having to scroll and search to get all the data and use the bulk API to put data into the new index.

0 讨论(0)
发布评论:

提交评论
- 加载中...

终归单人心

2020-12-14 04:44

The best way to reindex is to use Elasticsearch's builtin Reindex API as it is well supported and resilient to known issues.

The Elasticsaerch Reindex API uses scroll and bulk indexing in batches , and allows for scripted transformation of data. In Python, a similar routine could be developed:

#!/usr/local/bin/python
from elasticsearch import Elasticsearch
from elasticsearch import helpers

src = Elasticsearch(['localhost:9202'])
dst = Elasticsearch(['localhost:9200'])

body = {"query": { "match_all" : {}}}

source_index='src-index'
target_index='dst-index'
scroll_time='60s'
batch_size='500'

def transform(hits):
    for h in hits:
        h['_index'] = target_index
        yield h

rs = src.search(index=[source_index],
        scroll=scroll_time,
        size=batch_size,
        body=body
   )

helpers.bulk(dst, transform(rs['hits']['hits']), chunk_size=batch_size)

while True:
    scroll_id = rs['_scroll_id']
    rs = src.scroll(scroll_id=scroll_id, scroll=scroll_time)
    if len(rs['hits']['hits']) > 0:
        helpers.bulk(dst, transform(rs['hits']['hits']), chunk_size=batch_size)
    else:
        break;

0 讨论(0)

梦谈多话

2020-12-14 04:46

here is an example of reindexing to another elasticsearch node using elasticsearch-py:

from elasticsearch import helpers
es_src = Elasticsearch(["host"])
es_des = Elasticsearch(["host"])

helpers.reindex(es_src, 'src_index_name', 'des_index_name', target_client=es_des)

you can also reindex the result of a query to a different index here is how to do it:

from elasticsearch import helpers
es_src = Elasticsearch(["host"])
es_des = Elasticsearch(["host"])

body = {"query": {"term": {"year": "2004"}}}
helpers.reindex(es_src, 'src_index_name', 'des_index_name', target_client=es_des, query=body)

0 讨论(0)