How to read a large json in pandas?

后端 未结 2 693
悲哀的现实
悲哀的现实 2021-02-08 10:31

My code is :data_review=pd.read_json(\'review.json\') I have the data review as fllow:

{
    // string, 22 character unique review id
         


        
2条回答
  •  栀梦
    栀梦 (楼主)
    2021-02-08 11:21

    Perhaps, the file you are reading contains multiple json objects rather and than a single json or array object which the methods json.load(json_file) and pd.read_json('review.json') are expecting. These methods are supposed to read files with single json object.

    From the yelp dataset I have seen, your file must be containing something like:

    {"review_id":"xxxxx","user_id":"xxxxx","business_id":"xxxx","stars":5,"date":"xxx-xx-xx","text":"xyxyxyxyxx","useful":0,"funny":0,"cool":0}
    {"review_id":"yyyy","user_id":"yyyyy","business_id":"yyyyy","stars":3,"date":"yyyy-yy-yy","text":"ababababab","useful":0,"funny":0,"cool":0}
    ....    
    ....
    
    and so on.
    

    Hence, it is important to realize that this is not single json data rather it is multiple json objects in one file.

    To read this data into pandas data frame the following solution should work:

    import pandas as pd
    
    with open('review.json') as json_file:      
        data = json_file.readlines()
        # this line below may take at least 8-10 minutes of processing for 4-5 million rows. It converts all strings in list to actual json objects. 
        data = list(map(json.loads, data)) 
    
    pd.DataFrame(data)
    

    Assuming the size of data to be pretty large, I think your machine will take considerable amount of time to load the data into data frame.

提交回复
热议问题