Efficient way to retrieve all _ids in ElasticSearch

后端 未结 11 1811
轮回少年
轮回少年 2021-01-31 01:31

What is the fastest way to get all _ids of a certain index from ElasticSearch? Is it possible by using a simple query? One of my index has around 20,000 documents.

11条回答
  •  天涯浪人
    2021-01-31 01:59

    I know this post has a lot of answers, but I want to combine several to document what I've found to be fastest (in Python anyway). I'm dealing with hundreds of millions of documents, rather than thousands.

    The helpers class can be used with sliced scroll and thus allow multi-threaded execution. In my case, I have a high cardinality field to provide (acquired_at) as well. You'll see I set max_workers to 14, but you may want to vary this depending on your machine.

    Additionally, I store the doc ids in compressed format. If you're curious, you can check how many bytes your doc ids will be and estimate the final dump size.

    # note below I have es, index, and cluster_name variables already set
    
    max_workers = 14
    scroll_slice_ids = list(range(0,max_workers))
    
    def get_doc_ids(scroll_slice_id):
        count = 0
        with gzip.open('/tmp/doc_ids_%i.txt.gz' % scroll_slice_id, 'wt') as results_file:
            query = {"sort": ["_doc"], "slice": { "field": "acquired_at", "id": scroll_slice_id, "max": len(scroll_slice_ids)+1}, "_source": False}
            scan = helpers.scan(es, index=index, query=query, scroll='10m', size=10000, request_timeout=600)
            for doc in scan:
                count += 1
                results_file.write((doc['_id'] + '\n'))
                results_file.flush()
    
        return count 
    
    if __name__ == '__main__':
        print('attempting to dump doc ids from %s in %i slices' % (cluster_name, len(scroll_slice_ids)))
        with futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
            doc_counts = executor.map(get_doc_ids, scroll_slice_ids)
    

    If you want to follow along with how many ids are in the files, you can use unpigz -c /tmp/doc_ids_4.txt.gz | wc -l.

提交回复
热议问题