What is the fastest way to get all _ids of a certain index from ElasticSearch? Is it possible by using a simple query? One of my index has around 20,000 documents.
I know this post has a lot of answers, but I want to combine several to document what I've found to be fastest (in Python anyway). I'm dealing with hundreds of millions of documents, rather than thousands.
The helpers
class can be used with sliced scroll and thus allow multi-threaded execution. In my case, I have a high cardinality field to provide (acquired_at
) as well. You'll see I set max_workers
to 14, but you may want to vary this depending on your machine.
Additionally, I store the doc ids in compressed format. If you're curious, you can check how many bytes your doc ids will be and estimate the final dump size.
# note below I have es, index, and cluster_name variables already set
max_workers = 14
scroll_slice_ids = list(range(0,max_workers))
def get_doc_ids(scroll_slice_id):
count = 0
with gzip.open('/tmp/doc_ids_%i.txt.gz' % scroll_slice_id, 'wt') as results_file:
query = {"sort": ["_doc"], "slice": { "field": "acquired_at", "id": scroll_slice_id, "max": len(scroll_slice_ids)+1}, "_source": False}
scan = helpers.scan(es, index=index, query=query, scroll='10m', size=10000, request_timeout=600)
for doc in scan:
count += 1
results_file.write((doc['_id'] + '\n'))
results_file.flush()
return count
if __name__ == '__main__':
print('attempting to dump doc ids from %s in %i slices' % (cluster_name, len(scroll_slice_ids)))
with futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
doc_counts = executor.map(get_doc_ids, scroll_slice_ids)
If you want to follow along with how many ids are in the files, you can use unpigz -c /tmp/doc_ids_4.txt.gz | wc -l
.