How to use Bulk API to store the keywords in ES by using Python

前端 未结 4 1793
时光说笑
时光说笑 2020-11-27 12:12

I have to store some message in ElasticSearch integrate with my python program. Now what I try to store the message is:

d={\"message\":\"this is message\"}
         


        
相关标签:
4条回答
  • 2020-11-27 12:21

    (the other approaches mentioned in this thread use python list for the ES update, which is not a good solution today, especially when you need to add millions of data to ES)

    Better approach is using python generators -- process gigs of data without going out of memory or compromising much on speed.

    Below is an example snippet from a practical use case - adding data from nginx log file to ES for analysis.

    def decode_nginx_log(_nginx_fd):
        for each_line in _nginx_fd:
            # Filter out the below from each log line
            remote_addr = ...
            timestamp   = ...
            ...
    
            # Index for elasticsearch. Typically timestamp.
            idx = ...
    
            es_fields_keys = ('remote_addr', 'timestamp', 'url', 'status')
            es_fields_vals = (remote_addr, timestamp, url, status)
    
            # We return a dict holding values from each line
            es_nginx_d = dict(zip(es_fields_keys, es_fields_vals))
    
            # Return the row on each iteration
            yield idx, es_nginx_d   # <- Note the usage of 'yield'
    
    def es_add_bulk(nginx_file):
        # The nginx file can be gzip or just text. Open it appropriately.
        ...
    
        es = Elasticsearch(hosts = [{'host': 'localhost', 'port': 9200}])
    
        # NOTE the (...) round brackets. This is for a generator.
        k = ({
                "_index": "nginx",
                "_type" : "logs",
                "_id"   : idx,
                "_source": es_nginx_d,
             } for idx, es_nginx_d in decode_nginx_log(_nginx_fd))
    
        helpers.bulk(es, k)
    
    # Now, just run it.
    es_add_bulk('./nginx.1.log.gz')
    

    This skeleton demonstrates the usage of generators. You can use this even on a bare machine if you need to. And you can go on expanding on this to tailor to your needs quickly.

    Python Elasticsearch reference here.

    0 讨论(0)
  • 2020-11-27 12:30
    from datetime import datetime
    
    from elasticsearch import Elasticsearch
    from elasticsearch import helpers
    
    es = Elasticsearch()
    
    actions = [
      {
        "_index": "tickets-index",
        "_type": "tickets",
        "_id": j,
        "_source": {
            "any":"data" + str(j),
            "timestamp": datetime.now()}
      }
      for j in range(0, 10)
    ]
    
    helpers.bulk(es, actions)
    
    0 讨论(0)
  • 2020-11-27 12:30

    There are two options which I can think of at the moment:

    1. Define index name and document type with each entity:

    es_client = Elasticsearch()
    
    body = []
    for entry in entries:
        body.append({'index': {'_index': index, '_type': 'doc', '_id': entry['id']}})
        body.append(entry)
    
    response = es_client.bulk(body=body)
    

    2. Provide the default index and document type with the method:

    es_client = Elasticsearch()
    
    body = []
    for entry in entries:
        body.append({'index': {'_id': entry['id']}})
        body.append(entry)
    
    response = es_client.bulk(index='my_index', doc_type='doc', body=body)
    

    Works with:

    ES version:6.4.0

    ES python lib: 6.3.1

    0 讨论(0)
  • 2020-11-27 12:33

    Although @justinachen 's code helped me start with py-elasticsearch, after looking in the source code let me do a simple improvement:

    es = Elasticsearch()
    j = 0
    actions = []
    while (j <= 10):
        action = {
            "_index": "tickets-index",
            "_type": "tickets",
            "_id": j,
            "_source": {
                "any":"data" + str(j),
                "timestamp": datetime.now()
                }
            }
        actions.append(action)
        j += 1
    
    helpers.bulk(es, actions)
    

    helpers.bulk() already does the segmentation for you. And by segmentation I mean the chucks sent every time to the server. If you want to reduce the chunk of sent documents do: helpers.bulk(es, actions, chunk_size=100)

    Some handy info to get started:

    helpers.bulk() is just a wrapper of the helpers.streaming_bulk but the first accepts a list which makes it handy.

    helpers.streaming_bulk has been based on Elasticsearch.bulk() so you do not need to worry about what to choose.

    So in most cases, helpers.bulk() should be all you need.

    0 讨论(0)
提交回复
热议问题