Optimization: Dumping JSON from a Streaming API to Mongo

前端 未结 2 769
猫巷女王i
猫巷女王i 2021-02-15 17:33

Background: I have a python module set up to grab JSON objects from a streaming API and store them (bulk insert of 25 at a time) in MongoDB using p

相关标签:
2条回答
  • 2021-02-15 17:47

    Got rid of the StringIO library. As the WRITEFUNCTION callback handle_data, in this case, gets invoked for every line, just load the JSON directly. Sometimes, however, there could be two JSON objects contained in data. I am sorry, I can't post the curl command that I use as it contains our credentials. But, as I said, this is a general issue applicable to any streaming API.

    
    def handle_data(self, buf): 
        try:
            self.tweet = json.loads(buf)
        except Exception as json_ex:
            self.data_list = buf.split('\r\n')
            for data in self.data_list:
                self.tweet_list.append(json.loads(data))    
    
    0 讨论(0)
  • 2021-02-15 17:57

    Originally there was a bug in your code.

                    if self.chunk_count % 50 == 0
                        self.raw_tweets.insert(self.tweet_list)
                        self.chunk_count = 0
    

    You reset the chunk_count but you don't reset the tweet_list. So second time through you try to insert 100 items (50 new ones plus 50 that were already sent to DB the time before). You've fixed this, but still see a difference in performance.

    The whole batch size thing turns out to be a red herring. I tried using a large file of json and loading it via python vs. loading it via mongoimport and Python was always faster (even in safe mode - see below).

    Taking a closer look at your code, I realized the problem is with the fact that the streaming API is actually handing you data in chunks. You are expected to just take those chunks and put them into the database (that's what mongoimport is doing). The extra work your python is doing to split up the stream, add it to a list and then periodically send batches to Mongo is probably the difference between what I see and what you see.

    Try this snippet for your handle_data()

    def handle_data(self, data):
        try:
            string_buffer = StringIO(data)
            tweets = json.load(string_buffer)
        except Exception as ex:
            print "Exception occurred: %s" % str(ex)
        try:
            self.raw_tweets.insert(tweets)
        except Exception as ex:
            print "Exception occurred: %s" % str(ex)
    

    One thing to note is that your python inserts are not running in "safe mode" - you should change that by adding an argument safe=True to your insert statement. You will then get an exception on any insert that fails and your try/catch will print the error exposing the problem.

    It doesn't cost much in performance either - I'm currently running a test and after about five minutes, the sizes of two collections are 14120 14113.

    0 讨论(0)
提交回复
热议问题