Efficient way to retrieve all _ids in ElasticSearch

后端 未结 11 1793
轮回少年
轮回少年 2021-01-31 01:31

What is the fastest way to get all _ids of a certain index from ElasticSearch? Is it possible by using a simple query? One of my index has around 20,000 documents.

相关标签:
11条回答
  • 2021-01-31 01:46

    Another option

    curl 'http://localhost:9200/index/type/_search?pretty=true&fields='
    

    will return _index, _type, _id and _score.

    0 讨论(0)
  • 2021-01-31 01:47

    you can also do it in python, which gives you a proper list:

    import elasticsearch
    es = elasticsearch.Elasticsearch()
    
    res = es.search(
        index=your_index, 
        body={"query": {"match_all": {}}, "size": 30000, "fields": ["_id"]})
    
    ids = [d['_id'] for d in res['hits']['hits']]
    
    0 讨论(0)
  • 2021-01-31 01:50

    Better to use scroll and scan to get the result list so elasticsearch doesn't have to rank and sort the results.

    With the elasticsearch-dsl python lib this can be accomplished by:

    from elasticsearch import Elasticsearch
    from elasticsearch_dsl import Search
    
    es = Elasticsearch()
    s = Search(using=es, index=ES_INDEX, doc_type=DOC_TYPE)
    
    s = s.fields([])  # only get ids, otherwise `fields` takes a list of field names
    ids = [h.meta.id for h in s.scan()]
    

    Console log:

    GET http://localhost:9200/my_index/my_doc/_search?search_type=scan&scroll=5m [status:200 request:0.003s]
    GET http://localhost:9200/_search/scroll?scroll=5m [status:200 request:0.005s]
    GET http://localhost:9200/_search/scroll?scroll=5m [status:200 request:0.005s]
    GET http://localhost:9200/_search/scroll?scroll=5m [status:200 request:0.003s]
    GET http://localhost:9200/_search/scroll?scroll=5m [status:200 request:0.005s]
    ...
    

    Note: scroll pulls batches of results from a query and keeps the cursor open for a given amount of time (1 minute, 2 minutes, which you can update); scan disables sorting. The scan helper function returns a python generator which can be safely iterated through.

    0 讨论(0)
  • 2021-01-31 01:53

    Elaborating on the 2 answers by @Robert-Lujo and @Aleck-Landgraf (someone with the permissions can gladly move this to a comment): if you do not want to print but get everything inside a list from the returned generator, here is what I use:

    from elasticsearch import Elasticsearch,helpers
    es = Elasticsearch(hosts=[YOUR_ES_HOST])
    a=helpers.scan(es,query={"query":{"match_all": {}}},scroll='1m',index=INDEX_NAME)#like others so far
    
    IDs=[aa['_id'] for aa in a]
    
    0 讨论(0)
  • 2021-01-31 01:59

    I know this post has a lot of answers, but I want to combine several to document what I've found to be fastest (in Python anyway). I'm dealing with hundreds of millions of documents, rather than thousands.

    The helpers class can be used with sliced scroll and thus allow multi-threaded execution. In my case, I have a high cardinality field to provide (acquired_at) as well. You'll see I set max_workers to 14, but you may want to vary this depending on your machine.

    Additionally, I store the doc ids in compressed format. If you're curious, you can check how many bytes your doc ids will be and estimate the final dump size.

    # note below I have es, index, and cluster_name variables already set
    
    max_workers = 14
    scroll_slice_ids = list(range(0,max_workers))
    
    def get_doc_ids(scroll_slice_id):
        count = 0
        with gzip.open('/tmp/doc_ids_%i.txt.gz' % scroll_slice_id, 'wt') as results_file:
            query = {"sort": ["_doc"], "slice": { "field": "acquired_at", "id": scroll_slice_id, "max": len(scroll_slice_ids)+1}, "_source": False}
            scan = helpers.scan(es, index=index, query=query, scroll='10m', size=10000, request_timeout=600)
            for doc in scan:
                count += 1
                results_file.write((doc['_id'] + '\n'))
                results_file.flush()
    
        return count 
    
    if __name__ == '__main__':
        print('attempting to dump doc ids from %s in %i slices' % (cluster_name, len(scroll_slice_ids)))
        with futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
            doc_counts = executor.map(get_doc_ids, scroll_slice_ids)
    

    If you want to follow along with how many ids are in the files, you can use unpigz -c /tmp/doc_ids_4.txt.gz | wc -l.

    0 讨论(0)
  • 2021-01-31 02:01

    Edit: Please read @Aleck Landgraf's Answer, too

    You just want the elasticsearch-internal _id field? Or an id field from within your documents?

    For the former, try

    curl http://localhost:9200/index/type/_search?pretty=true -d '
    { 
        "query" : { 
            "match_all" : {} 
        },
        "stored_fields": []
    }
    '
    

    Note 2017 Update: The post originally included "fields": [] but since then the name has changed and stored_fields is the new value.

    The result will contain only the "metadata" of your documents

    {
      "took" : 7,
      "timed_out" : false,
      "_shards" : {
        "total" : 5,
        "successful" : 5,
        "failed" : 0
      },
      "hits" : {
        "total" : 4,
        "max_score" : 1.0,
        "hits" : [ {
          "_index" : "index",
          "_type" : "type",
          "_id" : "36",
          "_score" : 1.0
        }, {
          "_index" : "index",
          "_type" : "type",
          "_id" : "38",
          "_score" : 1.0
        }, {
          "_index" : "index",
          "_type" : "type",
          "_id" : "39",
          "_score" : 1.0
        }, {
          "_index" : "index",
          "_type" : "type",
          "_id" : "34",
          "_score" : 1.0
        } ]
      }
    }
    

    For the latter, if you want to include a field from your document, simply add it to the fields array

    curl http://localhost:9200/index/type/_search?pretty=true -d '
    { 
        "query" : { 
            "match_all" : {} 
        },
        "fields": ["document_field_to_be_returned"]
    }
    '
    
    0 讨论(0)
提交回复
热议问题