Clean one column from long and big data set

问题

I am trying to clean only one column from the long and big data sets. The data has 18 columns, more than 10k+ rows about 100s of csv files, Of which I want to clean only one column.

Input fields only few from the long list

userLocation,   userTimezone,   Coordinates,
India,          Hawaii,    {u'type': u'Point', u'coordinates': [73.8567, 18.5203]}
California,     USA     
          ,     New Delhi,  
Ft. Sam Houston,Mountain Time (US & Canada),{u'type': u'Point', u'coordinates': [86.99643, 23.68088]}
Kathmandu,Nepal, Kathmandu, {u'type': u'Point', u'coordinates': [85.3248024, 27.69765658]}

Full input file: Dropbox link

Code:

    import pandas as pd

    data = pandas.read_cvs('input.csv')

    df =  ['tweetID', 'tweetText', 'tweetRetweetCt', 'tweetFavoriteCt',       
           'tweetSource', 'tweetCreated', 'userID', 'userScreen',
           'userName', 'userCreateDt', 'userDesc', 'userFollowerCt', 
           'userFriendsCt', 'userLocation', 'userTimezone', 'Coordinates',
           'GeoEnabled', 'Language']

    df0 = ['Coordinates']

Other columns are to written as it is in output. After this how to go about ?

Output:

userLocation,   userTimezone, Coordinate_one, Coordinate_one,
India,          Hawaii,         73.8567, 18.5203
California,     USA     
          ,     New Delhi,  
Ft. Sam Houston,Mountain Time (US & Canada),86.99643, 23.68088
Kathmandu,Nepal, Kathmandu, 85.3248024, 27.69765658

The possible easiest suggestion or direct me to some example will be a lot helpful.

回答1:

There are many things wrong here.

The file is not a simple csv and is not being appropriately parsed by your assumed data = pd.read_csv('input.csv').
The 'Coordinates' filed seems to be a json string
There are NaN's in that same field

This is what I've done so far. You'll want to do some work on your own parsing this file more appropriately

import pandas as pd

df1 = pd.read_csv('./Turkey_28.csv')

coords = df1[['tweetID', 'Coordinates']].set_index('tweetID')['Coordinates']

coords = coords.dropna().apply(lambda x: eval(x))
coords = coords[coords.apply(type) == dict]

def get_coords(x):
    return pd.Series(x['coordinates'], index=['Coordinate_one', 'Coordinate_two'])

coords = coords.apply(get_coords)

df2 = pd.concat([coords, df1.set_index('tweetID').reindex(coords.index)], axis=1)

print df2.head(2).T

tweetID                                         714602054988275712
Coordinate_one                                             23.2745
Coordinate_two                                             56.6165
tweetText        I'm at MK Appartaments in Dobele https://t.co/...
tweetRetweetCt                                                   0
tweetFavoriteCt                                                  0
tweetSource                                             Foursquare
tweetCreated                                   2016-03-28 23:56:21
userID                                                   782541481
userScreen                                            MartinsKnops
userName                                             Martins Knops
userCreateDt                                   2012-08-26 14:24:29
userDesc         I See Them Try But They Can't Do What I Do. Be...
userFollowerCt                                                 137
userFriendsCt                                                  164
userLocation                                        DOB Till I Die
userTimezone                                            Casablanca
Coordinates      {u'type': u'Point', u'coordinates': [23.274462...
GeoEnabled                                                    True
Language                                                        en

回答2:

10K rows doesn't look at all like Big Data. How many columns do you have?

I don't understand your code, it is broken, but an easy example manipulation:

df = pd.read_cvs('input.csv')
df['tweetID'] = df['tweetID'] + 1  # add 1
df.to_csv('output.csv', index=False)

If your data doesn't fit into memory you might consider using Dask.

来源：https://stackoverflow.com/questions/37255647/clean-one-column-from-long-and-big-data-set

标签

python

pandas

data-cleaning

bigdata