Elasticsearch bulk index in chunks using PyEs

前端未结

关注

 3  1411

I have a simple python script for indexing a CSV file containing 1 million rows:

import csv
from pyes import *

reader = csv.reader(open(\'data.csv\', \'rb\'


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  执念已碎        
                
              
                            
                2021-01-01 05:40
              
            
            
                                                                       
You can adjust the bulk size when you create the ES instance. Something like this:

conn = ES('127.0.0.1:9200', timeout=20.0, bulk_size=100)


The default bulk size is 400. That is, pyes is sending the bulk contents automatically when you got 400 documents in the bulk. If you want to send the bulk before bulk_size we reached (e.g.: before exiting), you can call conn.flush_bulk(forced=True)

I'm not sure if refreshing the index manually at every Nth document would be the best bet. Elasticsearch does it automatically by default, each second. What you can do is to increase that time. Something like this:

curl -XPUT localhost:9200/namesdb/_settings -d '{
    "index" : {
        "refresh_interval" : "3s"
    }
}'


Or, you can refresh manually, like Dragan suggested, but in this case it might make sense to disable Elasticsearch's auto-refresh by setting the interval to "-1". But you don't need to refresh every X documents, you can refresh after you finished inserting all of them.

More details here:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-update-settings.html

Please note that refreshing is quite expensive, and in my experience you're better off with either:
- letting Elastisearch to do the refreshes in the background
- disabling the refresh altogether and re-enabling it after I've finished inserting the whole bunch of documents
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  遇见更好的自我        
                
              
                            
                2021-01-01 05:54
              
            
            
                                                                       
on every Nth count run

es.refresh()


example here
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  没有蜡笔的小新        
                
              
                            
                2021-01-01 05:57
              
            
            
                                                                       
For future visitors, Elasticsearch-py supports bulk operations in a single call. Note that the _op_type field in each doc determines which operation occurs (it defaults to index if not present)

E.g.

import elasticsearch as ES
import elasticsearch.helpers as ESH

es = ES.Elasticsearch()
docs = [ doc1, doc2, doc3 ]

n_success, n_fail = ESH.bulk(es, docs, index='test_index', doc_type='test_doc',
                             stats_only=True)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复