I have data saved in a postgreSQL
database. I am querying this data using Python2.7 and turning it into a Pandas DataFrame. However, the last column of this dat
To convert the string to an actual dict, you can do df['Pollutant Levels'].map(eval)
. Afterwards, the solution below can be used to convert the dict to different columns.
Using a small example, you can use .apply(pd.Series)
:
In [2]: df = pd.DataFrame({'a':[1,2,3], 'b':[{'c':1}, {'d':3}, {'c':5, 'd':6}]})
In [3]: df
Out[3]:
a b
0 1 {u'c': 1}
1 2 {u'd': 3}
2 3 {u'c': 5, u'd': 6}
In [4]: df['b'].apply(pd.Series)
Out[4]:
c d
0 1.0 NaN
1 NaN 3.0
2 5.0 6.0
To combine it with the rest of the dataframe, you can concat
the other columns with the above result:
In [7]: pd.concat([df.drop(['b'], axis=1), df['b'].apply(pd.Series)], axis=1)
Out[7]:
a c d
0 1 1.0 NaN
1 2 NaN 3.0
2 3 5.0 6.0
Using your code, this also works if I leave out the iloc
part:
In [15]: pd.concat([df.drop('b', axis=1), pd.DataFrame(df['b'].tolist())], axis=1)
Out[15]:
a c d
0 1 1.0 NaN
1 2 NaN 3.0
2 3 5.0 6.0
Merlin's answer is better and super easy, but we don't need a lambda function. The evaluation of dictionary can be safely ignored by either of the following two ways as illustrated below:
Way 1: Two steps
# step 1: convert the `Pollutants` column to Pandas dataframe series
df_pol_ps = data_df['Pollutants'].apply(pd.Series)
df_pol_ps:
a b c
0 46 3 12
1 36 5 8
2 NaN 2 7
3 NaN NaN 11
4 82 NaN 15
# step 2: concat columns `a, b, c` and drop/remove the `Pollutants`
df_final = pd.concat([df, df_pol_ps], axis = 1).drop('Pollutants', axis = 1)
df_final:
StationID a b c
0 8809 46 3 12
1 8810 36 5 8
2 8811 NaN 2 7
3 8812 NaN NaN 11
4 8813 82 NaN 15
Way 2: The above two steps can be combined in one go:
df_final = pd.concat([df, df['Pollutants'].apply(pd.Series)], axis = 1).drop('Pollutants', axis = 1)
df_final:
StationID a b c
0 8809 46 3 12
1 8810 36 5 8
2 8811 NaN 2 7
3 8812 NaN NaN 11
4 8813 82 NaN 15
Try this: The data returned from SQL has to converted into a Dict.
or could it be "Pollutant Levels"
is now Pollutants'
StationID Pollutants
0 8809 {"a":"46","b":"3","c":"12"}
1 8810 {"a":"36","b":"5","c":"8"}
2 8811 {"b":"2","c":"7"}
3 8812 {"c":"11"}
4 8813 {"a":"82","c":"15"}
df2["Pollutants"] = df2["Pollutants"].apply(lambda x : dict(eval(x)) )
df3 = df2["Pollutants"].apply(pd.Series )
a b c
0 46 3 12
1 36 5 8
2 NaN 2 7
3 NaN NaN 11
4 82 NaN 15
result = pd.concat([df, df3], axis=1).drop('Pollutants', axis=1)
result
StationID a b c
0 8809 46 3 12
1 8810 36 5 8
2 8811 NaN 2 7
3 8812 NaN NaN 11
4 8813 82 NaN 15
>>> df
Station ID Pollutants
0 8809 {"a": "46", "b": "3", "c": "12"}
1 8810 {"a": "36", "b": "5", "c": "8"}
2 8811 {"b": "2", "c": "7"}
3 8812 {"c": "11"}
4 8813 {"a": "82", "c": "15"}
speed comparison for a large dataset of 10 million rows
>>> df = pd.concat([df]*100000).reset_index(drop=True)
>>> df = pd.concat([df]*20).reset_index(drop=True)
>>> print(df.shape)
(10000000, 2)
def apply_drop(df):
return df.join(df['Pollutants'].apply(pd.Series)).drop('Pollutants', axis=1)
def json_normalise_drop(df):
return df.join(pd.json_normalize(df.Pollutants)).drop('Pollutants', axis=1)
def tolist_drop(df):
return df.join(pd.DataFrame(df['Pollutants'].tolist())).drop('Pollutants', axis=1)
def vlues_tolist_drop(df):
return df.join(pd.DataFrame(df['Pollutants'].values.tolist())).drop('Pollutants', axis=1)
def pop_tolist(df):
return df.join(pd.DataFrame(df.pop('Pollutants').tolist()))
def pop_values_tolist(df):
return df.join(pd.DataFrame(df.pop('Pollutants').values.tolist()))
>>> %timeit apply_drop(df.copy())
1 loop, best of 3: 53min 20s per loop
>>> %timeit json_normalise_drop(df.copy())
1 loop, best of 3: 54.9 s per loop
>>> %timeit tolist_drop(df.copy())
1 loop, best of 3: 6.62 s per loop
>>> %timeit vlues_tolist_drop(df.copy())
1 loop, best of 3: 6.63 s per loop
>>> %timeit pop_tolist(df.copy())
1 loop, best of 3: 5.99 s per loop
>>> %timeit pop_values_tolist(df.copy())
1 loop, best of 3: 5.94 s per loop
+---------------------+-----------+
| apply_drop | 53min 20s |
| json_normalise_drop | 54.9 s |
| tolist_drop | 6.62 s |
| vlues_tolist_drop | 6.63 s |
| pop_tolist | 5.99 s |
| pop_values_tolist | 5.94 s |
+---------------------+-----------+
df.join(pd.DataFrame(df.pop('Pollutants').values.tolist()))
is the fastest
df = pd.concat([df['a'], df.b.apply(pd.Series)], axis=1)
You can use join
with pop
+ tolist
. Performance is comparable to concat
with drop
+ tolist
, but some may find this syntax cleaner:
res = df.join(pd.DataFrame(df.pop('b').tolist()))
Benchmarking with other methods:
df = pd.DataFrame({'a':[1,2,3], 'b':[{'c':1}, {'d':3}, {'c':5, 'd':6}]})
def joris1(df):
return pd.concat([df.drop('b', axis=1), df['b'].apply(pd.Series)], axis=1)
def joris2(df):
return pd.concat([df.drop('b', axis=1), pd.DataFrame(df['b'].tolist())], axis=1)
def jpp(df):
return df.join(pd.DataFrame(df.pop('b').tolist()))
df = pd.concat([df]*1000, ignore_index=True)
%timeit joris1(df.copy()) # 1.33 s per loop
%timeit joris2(df.copy()) # 7.42 ms per loop
%timeit jpp(df.copy()) # 7.68 ms per loop