Background:
I have a python
module set up to grab JSON objects from a streaming API and store them (bulk insert of 25 at a time) in MongoDB using p
Got rid of the StringIO library. As the WRITEFUNCTION
callback handle_data
, in this case, gets invoked for every line, just load the JSON
directly. Sometimes, however, there could be two JSON
objects contained in data. I am sorry, I can't post the curl
command that I use as it contains our credentials. But, as I said, this is a general issue applicable to any streaming API.
def handle_data(self, buf):
try:
self.tweet = json.loads(buf)
except Exception as json_ex:
self.data_list = buf.split('\r\n')
for data in self.data_list:
self.tweet_list.append(json.loads(data))
Originally there was a bug in your code.
if self.chunk_count % 50 == 0
self.raw_tweets.insert(self.tweet_list)
self.chunk_count = 0
You reset the chunk_count but you don't reset the tweet_list. So second time through you try to insert 100 items (50 new ones plus 50 that were already sent to DB the time before). You've fixed this, but still see a difference in performance.
The whole batch size thing turns out to be a red herring. I tried using a large file of json and loading it via python vs. loading it via mongoimport and Python was always faster (even in safe mode - see below).
Taking a closer look at your code, I realized the problem is with the fact that the streaming API is actually handing you data in chunks. You are expected to just take those chunks and put them into the database (that's what mongoimport is doing). The extra work your python is doing to split up the stream, add it to a list and then periodically send batches to Mongo is probably the difference between what I see and what you see.
Try this snippet for your handle_data()
def handle_data(self, data):
try:
string_buffer = StringIO(data)
tweets = json.load(string_buffer)
except Exception as ex:
print "Exception occurred: %s" % str(ex)
try:
self.raw_tweets.insert(tweets)
except Exception as ex:
print "Exception occurred: %s" % str(ex)
One thing to note is that your python inserts are not running in "safe mode" - you should change that by adding an argument safe=True
to your insert statement. You will then get an exception on any insert that fails and your try/catch will print the error exposing the problem.
It doesn't cost much in performance either - I'm currently running a test and after about five minutes, the sizes of two collections are 14120 14113.