问题
I need to read whole collection from MongoDB ( collection name is "test" ) in Python code. I tried like
self.__connection__ = Connection('localhost',27017)
dbh = self.__connection__['test_db']
collection = dbh['test']
How to read through collection in chunks by 1000 ( to avoid memory overflow because collection can be very large ) ?
回答1:
I agree with Remon, but you mention batches of 1000, which his answer doesn't really cover. You can set a batch size on the cursor:
cursor.batch_size(1000);
You can also skip records, e.g.:
cursor.skip(4000);
Is this what you're looking for? This is effectively a pagination pattern. However, if you're just trying to avoid memory exhaustion then you don't really need to set batch size or skip.
回答2:
Use cursors. Cursors have a "batchSize" variable that controls how many documents are actually sent to the client per batch after doing a query. You don't have to touch this setting though since the default is fine and the complexity if invoking "getmore" commands is hidden from you in most drivers. I'm not familiar with pymongo but it works like this :
cursor = db.col.find() // Get everything!
while(cursor.hasNext()) {
/* This will use the documents already fetched and if it runs out of documents in it's local batch it will fetch another X of them from the server (where X is batchSize). */
document = cursor.next();
// Do your magic here
}
回答3:
inspired by @Rafael Valero + fixing last chunk bug in his code and making it more general I created generator function to iterate through mongo collection with query and projection:
def iterate_by_chunks(collection, chunksize=1, start_from=0, query={}, projection={}):
chunks = range(start_from, collection.find(query).count(), int(chunksize))
num_chunks = len(chunks)
for i in range(1,num_chunks+1):
if i < num_chunks:
yield collection.find(query, projection=projection)[chunks[i-1]:chunks[i]]
else:
yield collection.find(query, projection=projection)[chunks[i-1]:chunks.stop]
so for example you first create an iterator like this:
mess_chunk_iter = iterate_by_chunks(db_local.conversation_messages, 200, 0, query={}, projection=projection)
and then iterate it by chunks:
chunk_n=0
total_docs=0
for docs in mess_chunk_iter:
chunk_n=chunk_n+1
chunk_len = 0
for d in docs:
chunk_len=chunk_len+1
total_docs=total_docs+1
print(f'chunk #: {chunk_n}, chunk_len: {chunk_len}')
print("total docs iterated: ", total_docs)
chunk #: 1, chunk_len: 400
chunk #: 2, chunk_len: 400
chunk #: 3, chunk_len: 400
chunk #: 4, chunk_len: 400
chunk #: 5, chunk_len: 400
chunk #: 6, chunk_len: 400
chunk #: 7, chunk_len: 281
total docs iterated: 2681
回答4:
Here is a generic solution to iterate over any iterator or generator by batch:
def _as_batch(cursor, batch_size=50):
# iterate over something (pymongo cursor, generator, ...) by batch.
# Note: the last batch may contain less than batch_size elements.
batch = []
try:
while True:
for _ in range(batch_size):
batch.append(next(cursor))
yield batch
batch = []
except StopIteration as e:
if len(batch):
yield batch
This will work as long as the cursor
defines a method __next__
(i.e. we can use next(cursor)
). Thus, we can use it on raw cursor or also on transformed records.
Examples
Simple usage:
for batch in db['coll_name'].find():
# do stuff
More complex usage (useful for bulk updates for example):
def update_func(doc):
# dummy transform function
doc['y'] = doc['x'] + 1
return doc
query = (update_func(doc) for doc in db['coll_name'].find())
for batch in _as_batch(query):
# do stuff
Reimplementation of the count()
function:
sum(map(len, _as_batch( db['coll_name'].find() )))
回答5:
To the create the initial connection currently in Python 2 using Pymongo:
host = 'localhost'
port = 27017
db_name = 'test_db'
collection_name = 'test'
To connect using MongoClient
# Connect to MongoDB
client = MongoClient(host=host, port=port)
# Make a query to the specific DB and Collection
dbh = client[dbname]
collection = dbh[collection_name]
So from here the proper answer. I want to read by using chunks (in this case of size 1000).
chunksize = 1000
For example we could decide the how many chunks of size (chunksize) we want.
# Some variables to create the chunks
skips_variable = range(0, db_aux[collection].find(query).count(), int(chunksize))
if len(skips_variable)<=1:
skips_variable = [0,len(skips_variable)]
Then we can retrieve each chunk.
for i in range(1,len(skips_variable)):
# Expand the cursor and retrieve data
data_from_chunk = dbh[collection_name].find(query)[skips_variable[i-1]:skips_variable[i]]))
Where query in this case is query = {}
.
Here I use similar ideas to create dataframes from MongoDB. Here I use something similar to write to MongoDB in chunks.
I hope it helps.
来源:https://stackoverflow.com/questions/9786736/how-to-read-through-collection-in-chunks-by-1000