elasticsearch scrolling using python client

泄露秘密 提交于 2019-12-05 22:31:53

Using python requests

import requests
import json

elastic_url = 'http://localhost:9200/my_index/_search?scroll=1m'
scroll_api_url = 'http://localhost:9200/_search/scroll'
headers = {'Content-Type': 'application/json'}

payload = {
    "size": 100,
    "sort": ["_doc"]
    "query": {
        "match" : {
            "title" : "elasticsearch"
        }
    }
}

r1 = requests.request(
    "POST",
    elastic_url,
    data=json.dumps(payload),
    headers=headers
)

# first batch data
try:
    res_json = r1.json()
    data = res_json['hits']['hits']
    _scroll_id = res_json['_scroll_id']
except KeyError:
    data = []
    _scroll_id = None
    print 'Error: Elastic Search: %s' % str(r1.json())
while data:
    print data
    # scroll to get next batch data
    scroll_payload = json.dumps({
        'scroll': '1m',
        'scroll_id': _scroll_id
    })
    scroll_res = requests.request(
        "POST", scroll_api_url,
        data=scroll_payload,
        headers=headers
    )
    try:
        res_json = scroll_res.json()
        data = res_json['hits']['hits']
        _scroll_id = res_json['_scroll_id']
    except KeyError:
        data = []
        _scroll_id = None
        err_msg = 'Error: Elastic Search Scroll: %s'
        print err_msg % str(scroll_res.json())

Reference: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html#search-request-scroll

In fact the code has a bug in it - in order to use the scroll feature correctly you are supposed to use the new scroll_id returned with each new call in the next call to scroll(), not reuse the first one:

Important

The initial search request and each subsequent scroll request returns a new scroll_id — only the most recent scroll_id should be used.

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-scroll.html

It's working because Elasticsearch does not always change the scroll_id in between calls and can for smaller result sets return the same scroll_id as was originally returned for some time. This discussion from last year is between two other users seeing the same issue, the same scroll_id being returned for awhile:

http://elasticsearch-users.115913.n3.nabble.com/Distributing-query-results-using-scrolling-td4036726.html

So while your code is working for a smaller result set it's not correct - you need to capture the scroll_id returned in each new call to scroll() and use that for the next call.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!