问题
I am comparing performance of the two dbs, plus csv - data is 1 million row by 5 column float, bulk insert into sqlite/mongodb/csv, done in python.
import csv
import sqlite3
import pymongo
N, M = 1000000, 5
data = np.random.rand(N, M)
docs = [{str(j): data[i, j] for j in range(len(data[i]))} for i in range(N)]
writing to csv takes 6.7 seconds:
%%time
with open('test.csv', 'w', newline='') as file:
writer = csv.writer(file, delimiter=',')
for i in range(N):
writer.writerow(data[i])
writing to sqlite3 takes 3.6 seconds:
%%time
con = sqlite3.connect('test.db')
con.execute('create table five(a, b, c, d, e)')
con.executemany('insert into five(a, b, c, d, e) values (?,?,?,?,?)', data)
writing to mongo takes 14.2 seconds:
%%time
with pymongo.MongoClient() as client:
start_w = time()
client['warmup']['warmup'].insert_many(docs)
start_w = time()
db = client['test']
coll = db['test']
start = time()
coll.insert_many(docs)
end = time()
I am still new to this, but is it expected that mongodb could be 4x slower sqlite, and 2x slower vs csv, in similar scenarios? It is based on mongodb v4.4 with WiredTiger engine, and python3.8.
I know mongodb excels when there is no fixed schema, but when each document has exactly the same key:value pairs, like the above example, are there methods to speed up the bulk insert?
EDIT: I tested adding a warmup in front of the 'real' write, as @D. SM suggested. It helps, but overall it is still the slowest of the pack. What I meant is, total Wall time 23.9s, (warmup 14.2 + real insert 9.6). What's interesting is that CPU times total 18.1s, meaning 23.9-18.1 = 5.8s was spent inside .insert_many()
method waiting for TCP/IO? That sounds a lot.
In any case, even if I use warmup and disregard the IO wait time, the remaining time left for the actual write is still likely larger than csv write, which is a million write() calls! Apparently the csv writer does much better job in buffering/caching. Did I get something seriously wrong here?
Another question somewhat related: the size of the collection file (/var/lib/mongodb/collection-xxx) does not seem to grow linearly, start from batch one, for each million insert, the size goes up by 57MB, 15MB, 75MB, 38MB, 45MB, 68MB. Sizes of compressed random data can vary, I understand, but the variation seems quite large. Is this expected?
回答1:
MongoDB clients connect to the servers in the background. If you want to benchmark inserts, a more accurate test would be something like this:
with pymongo.MongoClient() as client:
client['warmup']['warmup'].insert_many(docs)
db = client['test']
coll = db['test']
start = time()
coll.insert_many(docs)
end = time()
Keep in mind that insert_many performs a bulk write and there are limits on bulk write sizes, in particular there can be only 1000 commands per bulk write. If you are sending 1 million inserts you could be looking at 2000 splits per bulk write which all involve data copies. Test inserting 1000 documents at a time vs other batch sizes.
Working test:
import csv
import sqlite3
import pymongo, random, time
N, M = 1000000, 5
docs = [{'_id':1,'b':2,'c':3,'d':4,'e':5}]*N
i=1
for i in range(len(docs)):
docs[i]=dict(docs[i])
docs[i]['_id'] = i
data=[tuple(doc.values())for doc in docs]
with open('test.csv', 'w', newline='') as file:
writer = csv.writer(file, delimiter=',')
start = time.time()
for i in range(N):
writer.writerow(data[i])
end = time.time()
print('%f' %( end-start))
con = sqlite3.connect('test.db')
con.execute('drop table if exists five')
con.execute('create table five(a, b, c, d, e)')
start = time.time()
con.executemany('insert into five(a, b, c, d, e) values (?,?,?,?,?)', data)
end = time.time()
print('%f' %( end-start))
with pymongo.MongoClient() as client:
client['warmup']['warmup'].delete_many({})
client['test']['test'].delete_many({})
client['warmup']['warmup'].insert_many(docs)
db = client['test']
coll = db['test']
start = time.time()
coll.insert_many(docs)
end = time.time()
print('%f' %( end-start))
Results:
risque% python3 test.py
0.001464
0.002031
0.022351
risque% python3 test.py
0.013875
0.019704
0.153323
risque% python3 test.py
0.147391
0.236540
1.631367
risque% python3 test.py
1.492073
2.063393
16.289790
MongoDB is about 8x the sqlite time.
Is this expected? Perhaps. The comparison between sqlite and mongodb doesn't reveal much besides that sqlite is markedly faster. But, naturally, this is expected since mongodb utilizes a client/server architecture and sqlite is an in-process database, meaning:
- The client has to serialize the data to send to the server
- The server has to deserialize that data
- The server then has to parse the request and figure out what to do
- The server needs to write the data in a scalable/concurrent way (sqlite simply errors with concurrent write errors from what I remember of it)
- The server needs to compose a response back to the client, serialize that response, write it to the network
- Client needs to read the response, deserialize, check it for success
5.8s was spent inside .insert_many() method waiting for TCP/IO? That sounds a lot.
Compared to what - an in-process database that does not do any network i/o?
the remaining time left for the actual write is still likely larger than csv write, which is a million write() calls
The physical write calls are a small part of what goes into data storage by a modern database.
Besides which, neither case involves a million of them. When you write to file the writes are buffered by python's standard library before they are even sent to the kernel - you have to use flush()
after each line to actually produce a million writes. In a database the writes are similarly performed on a page by page basis and not for individual documents.
来源:https://stackoverflow.com/questions/65605360/mongodb-4x-slower-than-sqlite-2x-slower-than-csv