问题
We have a dataframe of almost 100000 records which i want to upsert in a mongodb collection.
My sample code is mentioned below.
For keeping it simple in below code, I am generating these data in a for loop and appending lstValues.
In actual application, we receive these data from external csv files which we load it into pandas dataframe.
We receive almost 98000 records from these external csv files. Also our original mongodb collection already contains almost 1,00,00,00 records and it keeps on increasing.
below i have just used few fields like studid, name, Grade, Address, Phone and Std. But in real application we have almost 200 such fields.
You can see i am using bulk_write function to batch update my collection. Also i am trying to make bathc size as 1000 records. But still, below code takes almost 20 minutes or more to upsert these records. We have an external application which is doing same this in almost 4 minutes. the goad here is to prove Python's capabilities to perform these type of batch operations with MongoDB. Not sure if am doing something wrong in below code ? or Is that the max that python can perform with such huge dataset ?
Please advice, how can i improve performance of my below code or any alternative to achive this within Python ?
from pymongo import MongoClient, ReplaceOne, InsertOne,DeleteOne
import pandas as pd
import time
import uuid
lstValues = []
for i in range(100000):
template = {'StudId': str(uuid.uuid1()) , 'Name':'xyz' + str(i), 'Grade':'A', 'Address':'abc', 'Phone':'0123', 'Std':'M1'}
lstValues.append(template)
bulklist = []
db = MongoClient(['server1:27017', 'server2:27018'],replicaset='rs_development',username='appadmin',password='abcxyz',authSource='admin',authMechanism='SCRAM-SHA-1')['TestDB']
starttime = time.time()
for m in lstValues:
bulklist.append(ReplaceOne(
{ "STUDENT.Grade": m['Grade'] , "STUDENT.Name": m['Name'] },
{'STUDENT': m },
upsert=True
))
if (len(bulklist) == 1000):
db.AnalyticsTestBRS.bulk_write(bulklist, ordered=False)
bulklist=[]
print("Time taken mongo upsert : {0} seconds".format((time.time() - starttime)))
来源:https://stackoverflow.com/questions/59931259/pymongo-bulk-write-perform-very-slow