问题
I am new to coding and I can understand that this is a very basic question
I have a dataframe as:
df
Unnamed: 0 time home_team away_team full_time_result both_teams_to_score double_chance
-- ------------ ------------------- ------------- -------------- ---------------------------------- ------------------------- ------------------------------------
0 0 2021-01-12 18:00:00 Sheff Utd Newcastle {'1': 2400, 'X': 3200, '2': 3100} {'yes': 2000, 'no': 1750} {'1X': 1360, '12': 1360, '2X': 1530}
1 1 2021-01-12 20:15:00 Burnley Man Utd {'1': 7000, 'X': 4500, '2': 1440} {'yes': 1900, 'no': 1900} {'1X': 2620, '12': 1180, '2X': 1100}
2 2 2021-01-12 20:15:00 Wolverhampton Everton {'1': 2450, 'X': 3200, '2': 3000} {'yes': 1950, 'no': 1800} {'1X': 1360, '12': 1360, '2X': 1530}
3 3 2021-01-13 18:00:00 Man City Brighton {'1': 1180, 'X': 6500, '2': 14000} {'yes': 2040, 'no': 1700} {'1X': 1040, '12': 1110, '2X': 4500}
4 4 2021-01-13 20:15:00 Aston Villa Tottenham {'1': 2620, 'X': 3500, '2': 2500} {'yes': 1570, 'no': 2250} {'1X': 1500, '12': 1280, '2X': 1440}
5 5 2021-01-14 20:00:00 Arsenal Crystal Palace {'1': 1500, 'X': 4000, '2': 6500} {'yes': 1950, 'no': 1800} {'1X': 1110, '12': 1220, '2X': 2500}
6 6 2021-01-15 20:00:00 Fulham Chelsea {'1': 5750, 'X': 4330, '2': 1530} {'yes': 1800, 'no': 1950} {'1X': 2370, '12': 1200, '2X': 1140}
7 7 2021-01-16 12:30:00 Wolverhampton West Brom {'1': 1440, 'X': 4200, '2': 7500} {'yes': 2250, 'no': 1570} {'1X': 1100, '12': 1220, '2X': 2620}
8 8 2021-01-16 15:00:00 Leeds Brighton {'1': 2000, 'X': 3600, '2': 3600} {'yes': 1530, 'no': 2370} {'1X': 1280, '12': 1280, '2X': 1720}
I am looking to format the dictionary list nicely and get the dataframe as e.g. the full_time_result column would be split into full_time_result_1, full_time_result_X, full_time_result_2 and the same for both_teams_to_score and double_chance as below:
Unnamed: 0 time home_team away_team full_time_result_1 full_time_result_x full_time_result_2 both_teams_to_score_yes both_teams_to_score_no double_chance_1X
-- ------------ ------------------- ------------- -------------- ---------------------------------- ------------------------- ------------------------------------
I am following this example given here but I am unable to get it to work. Here is my code:
import pandas as pd
from tabulate import tabulate
df = pd.read_csv(r'C:\Users\Harshad\Desktop\re.csv')
df['full_time_result'] = df['full_time_result'].apply(pd.Series)
print(tabulate(df, headers='keys'))
Unnamed: 0 time home_team away_team full_time_result both_teams_to_score double_chance
-- ------------ ------------------- ------------- -------------- ---------------------------------- ------------------------- ------------------------------------
0 0 2021-01-12 18:00:00 Sheff Utd Newcastle {'1': 2400, 'X': 3200, '2': 3100} {'yes': 2000, 'no': 1750} {'1X': 1360, '12': 1360, '2X': 1530}
1 1 2021-01-12 20:15:00 Burnley Man Utd {'1': 7000, 'X': 4500, '2': 1440} {'yes': 1900, 'no': 1900} {'1X': 2620, '12': 1180, '2X': 1100}
2 2 2021-01-12 20:15:00 Wolverhampton Everton {'1': 2450, 'X': 3200, '2': 3000} {'yes': 1950, 'no': 1800} {'1X': 1360, '12': 1360, '2X': 1530}
3 3 2021-01-13 18:00:00 Man City Brighton {'1': 1180, 'X': 6500, '2': 14000} {'yes': 2040, 'no': 1700} {'1X': 1040, '12': 1110, '2X': 4500}
4 4 2021-01-13 20:15:00 Aston Villa Tottenham {'1': 2620, 'X': 3500, '2': 2500} {'yes': 1570, 'no': 2250} {'1X': 1500, '12': 1280, '2X': 1440}
Help would be greatly appreciated.
回答1:
- Verify the columns are
dict
type, and notstr
type.- If the columns are
str
type, convert them withast.literal_eval
.
- If the columns are
- Use
pandas.json_normalize()
to normaize each column ofdicts
- Use a list-comprehension to rename the columns.
- Use
pandas.concat()
withaxis=1
to combine the dataframes.
import pandas as pd
from ast import literal_eval
# test dataframe
data = {'time': ['2021-01-12 18:00:00', '2021-01-12 20:15:00', '2021-01-12 20:15:00', '2021-01-13 18:00:00', '2021-01-13 20:15:00', '2021-01-14 20:00:00', '2021-01-15 20:00:00', '2021-01-16 12:30:00', '2021-01-16 15:00:00'], 'home_team': ['Sheff Utd', 'Burnley', 'Wolverhampton', 'Man City', 'Aston Villa', 'Arsenal', 'Fulham', 'Wolverhampton', 'Leeds'], 'away_team': ['Newcastle', 'Man Utd', 'Everton', 'Brighton', 'Tottenham', 'Crystal Palace', 'Chelsea', 'West Brom', 'Brighton'], 'full_time_result': ["{'1': 2400, 'X': 3200, '2': 3100}", "{'1': 7000, 'X': 4500, '2': 1440}", "{'1': 2450, 'X': 3200, '2': 3000}", "{'1': 1180, 'X': 6500, '2': 14000}", "{'1': 2620, 'X': 3500, '2': 2500}", "{'1': 1500, 'X': 4000, '2': 6500}", "{'1': 5750, 'X': 4330, '2': 1530}", "{'1': 1440, 'X': 4200, '2': 7500}", "{'1': 2000, 'X': 3600, '2': 3600}"], 'both_teams_to_score': ["{'yes': 2000, 'no': 1750}", "{'yes': 1900, 'no': 1900}", "{'yes': 1950, 'no': 1800}", "{'yes': 2040, 'no': 1700}", "{'yes': 1570, 'no': 2250}", "{'yes': 1950, 'no': 1800}", "{'yes': 1800, 'no': 1950}", "{'yes': 2250, 'no': 1570}", "{'yes': 1530, 'no': 2370}"], 'double_chance': ["{'1X': 1360, '12': 1360, '2X': 1530}", "{'1X': 2620, '12': 1180, '2X': 1100}", "{'1X': 1360, '12': 1360, '2X': 1530}", "{'1X': 1040, '12': 1110, '2X': 4500}", "{'1X': 1500, '12': 1280, '2X': 1440}", "{'1X': 1110, '12': 1220, '2X': 2500}", "{'1X': 2370, '12': 1200, '2X': 1140}", "{'1X': 1100, '12': 1220, '2X': 2620}", "{'1X': 1280, '12': 1280, '2X': 1720}"]}
df = pd.DataFrame(data)
# display(df.head(2))
time home_team away_team full_time_result both_teams_to_score double_chance
0 2021-01-12 18:00:00 Sheff Utd Newcastle {'1': 2400, 'X': 3200, '2': 3100} {'yes': 2000, 'no': 1750} {'1X': 1360, '12': 1360, '2X': 1530}
1 2021-01-12 20:15:00 Burnley Man Utd {'1': 7000, 'X': 4500, '2': 1440} {'yes': 1900, 'no': 1900} {'1X': 2620, '12': 1180, '2X': 1100}
# convert time to datetime
df.time = pd.to_datetime(df.time)
# determine if columns are str or dict type
print(type(df.iloc[0, 3]))
[out]:
str
# convert columns from str to dict only if the columns are str type
df.iloc[:, 3:] = df.iloc[:, 3:].applymap(literal_eval)
# normalize columns and rename headers
ftr = pd.json_normalize(df.full_time_result)
ftr.columns = [f'full_time_result_{col}' for col in ftr.columns]
btts = pd.json_normalize(df.both_teams_to_score)
btts.columns = [f'both_teams_to_score_{col}' for col in btts.columns]
dc = pd.json_normalize(df.double_chance)
dc.columns = [f'double_chance_{col}' for col in dc.columns]
# concat the dataframes
df_normalized = pd.concat([df.iloc[:, :3], ftr, btts, dc], axis=1)
display(df_normalized)
time home_team away_team full_time_result_1 full_time_result_X full_time_result_2 both_teams_to_score_yes both_teams_to_score_no double_chance_1X double_chance_12 double_chance_2X
0 2021-01-12 18:00:00 Sheff Utd Newcastle 2400 3200 3100 2000 1750 1360 1360 1530
1 2021-01-12 20:15:00 Burnley Man Utd 7000 4500 1440 1900 1900 2620 1180 1100
2 2021-01-12 20:15:00 Wolverhampton Everton 2450 3200 3000 1950 1800 1360 1360 1530
3 2021-01-13 18:00:00 Man City Brighton 1180 6500 14000 2040 1700 1040 1110 4500
4 2021-01-13 20:15:00 Aston Villa Tottenham 2620 3500 2500 1570 2250 1500 1280 1440
5 2021-01-14 20:00:00 Arsenal Crystal Palace 1500 4000 6500 1950 1800 1110 1220 2500
6 2021-01-15 20:00:00 Fulham Chelsea 5750 4330 1530 1800 1950 2370 1200 1140
7 2021-01-16 12:30:00 Wolverhampton West Brom 1440 4200 7500 2250 1570 1100 1220 2620
8 2021-01-16 15:00:00 Leeds Brighton 2000 3600 3600 1530 2370 1280 1280 1720
Consolidated Code
# convert the columns to dict type if they are str type
df.iloc[:, 3:] = df.iloc[:, 3:].applymap(literal_eval)
# normalize all columns
df_list = list()
for col in df.columns[3:]:
v = pd.json_normalize(df[col])
v.columns = [f'{col}_{c}' for c in v.columns]
df_list.append(v)
# combine into one dataframe
df_normalized = pd.concat([df.iloc[:, :3]] + df_list, axis=1)
来源:https://stackoverflow.com/questions/65588159/how-to-normalize-multiple-columns-of-dicts-in-a-pandas-dataframe