Loading Large Twitter JSON Data (7GB+) into Python

生来就可爱ヽ(ⅴ<●) 提交于 2019-12-01 09:24:44

问题


I've set up a public stream via AWS to collect tweets and now want to do some preliminary analysis. All my data was stored on an S3 bucket (in 5mb files).

I downloaded everything and merged all the files into one. Each tweet is stored as a standard JSON object as per Twitter specifications.

Basically, the consolidated file contains multiple JSON objects. I added opening and closing square brackets ( [] ) to make it look like a list of dictionaries for when it gets read into Python. So the structure is kinda like this (I'm not sure if I can just post twitter data here):

[{"created_at":"Mon Sep 19 23:58:50 +000 2016", "id":<num>, "id_str":"<num>","text":"<tweet message>", etc.}, 
{same as above},
{same as above}]

After deleting the very first tweet, I put everything into www.jsonlint.com and confirmed that it is a valid JSON data structure.

Now, I'm trying to load this data into Python and hoping to do some basic counts of different terms in tweets (e.g. how many times is @HillaryClinton mentioned in the text of a tweet, etc.).

Previously with smaller datasets, I was able to get away with code like this:

import json
import csv
import io
data_json = open('fulldata.txt', 'r', encoding='utf-8')
data_python = json.load(data.json)

I then wrote the data for respective fields into a CSV file and performed my analyses that way. This worked for a 2GB file.

Now that I have a 7GB file, I am noticing that if I use this method, Python throws an error in the "json.load(data.json)" line saying "OSError: [Errno 22] Invalid Argument.

I'm not sure why this is happening but I anticipate that it might be because it's trying to load the entire file at once into memory. Is this correct?

So I was trying to use ijson which apparently lets you parse through the json file. I tried to write the following code:

import ijson
f = open('fulldata.txt', 'r', encoding='utf-8')
content = ijson.items(f, 'item')
for item in content:
    <do stuff here>

With this implementation, I get an error on the line "for item in content" saying "ijson.backends.python.unexpectedsymbol: unexpected symbol '/u201c' at 1

I also tried to go through each line of the data file and go through it as a JSON lines format. So, assuming each line was a JSON object, I wrote:

raw_tweets = []
with open('full_data.txt', 'r', encoding='utf-8') as full_file:
     for line in full_file:
         raw_tweets.append(json.dumps(line))
print(len(raw_tweets)) #this worked. got like 2 million something as expected!
enter code here

But here, each entry into the list was a string and not a dictionary which made it really hard to parse the data I needed out of it. Is there a way to modify this last code to make it work as I need? But even then, wouldn't loading that whole dataset into a list make it still hard for future analyses given memory constraints?

I'm a little stuck about the best way to proceed with this. I really want to do this in Python because I'm trying to learn how to use Python tools for these kinds of analyses.

Does any have any experience with this? Am I being really stupid or misunderstanding something really basic?

EDIT:

So, I first went to www.jsonlint.com and pasted my entire dataset and found that after removing the first tweet, it was in valid JSON format. So for now I just excluded that one file.

I basically have a dataset in the format mentioned above ([{json1}, {json2}] where each entity in the {} represents a tweet.

Now that I confirmed that it was a valid JSON, my goal was to get it into python with each JSON being represented as a dictionary (so I could easily manipulate those files). can someone correct my thought-process here if it's inefficient?

To do so, I did:

raw_tweets=[]
with open('fulldata.txt', 'r', encoding='ISO-8859-1') as full_file:
     for line in full_file:
         raw_tweets.append(json.dumps(line))
#This successfully wrote each line of my file into a list. Confirmed by checking length, as described previously.
#Now I want to write this out to a csv file. 
csv_out = io.open("parsed_data.csv", mode = 'w', encoding='ISO-8859-1')
fields = u'created_at,text,screen_name,followers<friends,rt,fav'
csv_out.write(fields) #Write the column headers out. 
csv_out.write(u'\n')
#Now, iterate through the list. Get each JSON object as a dictionary and pull out the relevant information.
for tweet in raw_tweets:
#Each "tweet" is {json#},\n'
    current_tweet = json.loads(tweet) #right now everything is a list of strings in the {} format but it's just a string and not a dictionary. If I convert it to a JSON object, I should be able to make a dictionary form of the data right?
row = [current_tweet.get('created_at'), '"' + line.get('text').replace('"','""') + '"', line.get('user).get('screen_name')] #and I continue this for all relevant headers

Problem is, that last line where I say current_tweet.get isn't working because it keeps saying that 'str' has no attribute 'get' so I'm not sure why json.loads() isn't giving me a dictionary...

EDIT#2

A user recommended I remove the [ and ] and also the trailing commas so that each line has valid JSON. That way I could just json.loads() each line. I removed the brackets as suggested. For the commas, I did this:

raw_tweets=[]
with open('fulldata.txt', 'r', encoding='ISO-8859-1') as full_file:
     for line in full_file:
         no_comma = line[:-2] #Printed this to confirm that final comma was removed
         raw_tweets.append(json.dumps(line))

This is giving an error saying ValueError: Expecting ':' Delimiter: Line 1 Column 2305 (char 2304)

To debug this, I printed the first line (i.e. I just said print(no_comma)) and I noticed that what Python printed actually had multiple tweets inside... When I open it in an editor like "UltraEdit" I notice that each tweet is a distinct line so I assumed that each JSON object was separated by a newline character. But here, when I print the results after iterating by line, I see that it's pulling in multiple tweets at once.

Should I be iterating differently? Is my method of removing the commas appropriate or should I be pre-processing the file separately?

I'm pretty sure that my JSON is formatted poorly but I'm not sure why and how to go about fixing it. Here is a sample of my JSON data. If this isn't allowed, I'll remove it...

https://ufile.io/47b1


回答1:


I'm a VERY new user, but I might be able to offer a partial solution. I believe your formatting is off. You can't just import it as JSON without it being in JSON format. You should be able to fix this if you can get the tweets into a data frame (or separate data frames) and then use the "DataFrame.to_json" command. You WILL need Pandas if not already installed.

Pandas - http://pandas.pydata.org/pandas-docs/stable/10min.html

Dataframe - http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_json.html




回答2:


Instead of having the entire file as a JSON object, put one JSON object per line for large datasets!

To fix the formatting, you should

  1. Remove the [ at the start of the file
  2. Remove the ] at the end of the file
  3. Remove the comma at the end of each line

Then you can read the file as so:

with open('one_json_per_line.txt', 'r') as infile:
    for line in infile:
        data_row = json.loads(line)

I would suggest using a different storage if possible. SQLite comes to mind.



来源:https://stackoverflow.com/questions/39781716/loading-large-twitter-json-data-7gb-into-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!