How to normalize multiple columns of dicts in a pandas dataframe

问题

I am new to coding and I can understand that this is a very basic question

I have a dataframe as:

df

      Unnamed: 0  time                 home_team      away_team       full_time_result                    both_teams_to_score        double_chance
--  ------------  -------------------  -------------  --------------  ----------------------------------  -------------------------  ------------------------------------
 0             0  2021-01-12 18:00:00  Sheff Utd      Newcastle       {'1': 2400, 'X': 3200, '2': 3100}   {'yes': 2000, 'no': 1750}  {'1X': 1360, '12': 1360, '2X': 1530}
 1             1  2021-01-12 20:15:00  Burnley        Man Utd         {'1': 7000, 'X': 4500, '2': 1440}   {'yes': 1900, 'no': 1900}  {'1X': 2620, '12': 1180, '2X': 1100}
 2             2  2021-01-12 20:15:00  Wolverhampton  Everton         {'1': 2450, 'X': 3200, '2': 3000}   {'yes': 1950, 'no': 1800}  {'1X': 1360, '12': 1360, '2X': 1530}
 3             3  2021-01-13 18:00:00  Man City       Brighton        {'1': 1180, 'X': 6500, '2': 14000}  {'yes': 2040, 'no': 1700}  {'1X': 1040, '12': 1110, '2X': 4500}
 4             4  2021-01-13 20:15:00  Aston Villa    Tottenham       {'1': 2620, 'X': 3500, '2': 2500}   {'yes': 1570, 'no': 2250}  {'1X': 1500, '12': 1280, '2X': 1440}
 5             5  2021-01-14 20:00:00  Arsenal        Crystal Palace  {'1': 1500, 'X': 4000, '2': 6500}   {'yes': 1950, 'no': 1800}  {'1X': 1110, '12': 1220, '2X': 2500}
 6             6  2021-01-15 20:00:00  Fulham         Chelsea         {'1': 5750, 'X': 4330, '2': 1530}   {'yes': 1800, 'no': 1950}  {'1X': 2370, '12': 1200, '2X': 1140}
 7             7  2021-01-16 12:30:00  Wolverhampton  West Brom       {'1': 1440, 'X': 4200, '2': 7500}   {'yes': 2250, 'no': 1570}  {'1X': 1100, '12': 1220, '2X': 2620}
 8             8  2021-01-16 15:00:00  Leeds          Brighton        {'1': 2000, 'X': 3600, '2': 3600}   {'yes': 1530, 'no': 2370}  {'1X': 1280, '12': 1280, '2X': 1720}

I am looking to format the dictionary list nicely and get the dataframe as e.g. the full_time_result column would be split into full_time_result_1, full_time_result_X, full_time_result_2 and the same for both_teams_to_score and double_chance as below:

      Unnamed: 0  time                 home_team      away_team       full_time_result_1                    full_time_result_x                    full_time_result_2                    both_teams_to_score_yes        both_teams_to_score_no        double_chance_1X
--  ------------  -------------------  -------------  --------------  ----------------------------------  -------------------------  ------------------------------------

I am following this example given here but I am unable to get it to work. Here is my code:

import pandas as pd
from tabulate import tabulate
df = pd.read_csv(r'C:\Users\Harshad\Desktop\re.csv')
df['full_time_result'] = df['full_time_result'].apply(pd.Series)
print(tabulate(df, headers='keys'))

      Unnamed: 0  time                 home_team      away_team       full_time_result                    both_teams_to_score        double_chance
--  ------------  -------------------  -------------  --------------  ----------------------------------  -------------------------  ------------------------------------
 0             0  2021-01-12 18:00:00  Sheff Utd      Newcastle       {'1': 2400, 'X': 3200, '2': 3100}   {'yes': 2000, 'no': 1750}  {'1X': 1360, '12': 1360, '2X': 1530}
 1             1  2021-01-12 20:15:00  Burnley        Man Utd         {'1': 7000, 'X': 4500, '2': 1440}   {'yes': 1900, 'no': 1900}  {'1X': 2620, '12': 1180, '2X': 1100}
 2             2  2021-01-12 20:15:00  Wolverhampton  Everton         {'1': 2450, 'X': 3200, '2': 3000}   {'yes': 1950, 'no': 1800}  {'1X': 1360, '12': 1360, '2X': 1530}
 3             3  2021-01-13 18:00:00  Man City       Brighton        {'1': 1180, 'X': 6500, '2': 14000}  {'yes': 2040, 'no': 1700}  {'1X': 1040, '12': 1110, '2X': 4500}
 4             4  2021-01-13 20:15:00  Aston Villa    Tottenham       {'1': 2620, 'X': 3500, '2': 2500}   {'yes': 1570, 'no': 2250}  {'1X': 1500, '12': 1280, '2X': 1440}

Help would be greatly appreciated.

回答1:

Verify the columns are dict type, and not str type.
- If the columns are str type, convert them with ast.literal_eval.
Use pandas.json_normalize() to normaize each column of dicts
Use a list-comprehension to rename the columns.
Use pandas.concat() with axis=1 to combine the dataframes.

import pandas as pd
from ast import literal_eval

# test dataframe
data = {'time': ['2021-01-12 18:00:00', '2021-01-12 20:15:00', '2021-01-12 20:15:00', '2021-01-13 18:00:00', '2021-01-13 20:15:00', '2021-01-14 20:00:00', '2021-01-15 20:00:00', '2021-01-16 12:30:00', '2021-01-16 15:00:00'], 'home_team': ['Sheff Utd', 'Burnley', 'Wolverhampton', 'Man City', 'Aston Villa', 'Arsenal', 'Fulham', 'Wolverhampton', 'Leeds'], 'away_team': ['Newcastle', 'Man Utd', 'Everton', 'Brighton', 'Tottenham', 'Crystal Palace', 'Chelsea', 'West Brom', 'Brighton'], 'full_time_result': ["{'1': 2400, 'X': 3200, '2': 3100}", "{'1': 7000, 'X': 4500, '2': 1440}", "{'1': 2450, 'X': 3200, '2': 3000}", "{'1': 1180, 'X': 6500, '2': 14000}", "{'1': 2620, 'X': 3500, '2': 2500}", "{'1': 1500, 'X': 4000, '2': 6500}", "{'1': 5750, 'X': 4330, '2': 1530}", "{'1': 1440, 'X': 4200, '2': 7500}", "{'1': 2000, 'X': 3600, '2': 3600}"], 'both_teams_to_score': ["{'yes': 2000, 'no': 1750}", "{'yes': 1900, 'no': 1900}", "{'yes': 1950, 'no': 1800}", "{'yes': 2040, 'no': 1700}", "{'yes': 1570, 'no': 2250}", "{'yes': 1950, 'no': 1800}", "{'yes': 1800, 'no': 1950}", "{'yes': 2250, 'no': 1570}", "{'yes': 1530, 'no': 2370}"], 'double_chance': ["{'1X': 1360, '12': 1360, '2X': 1530}", "{'1X': 2620, '12': 1180, '2X': 1100}", "{'1X': 1360, '12': 1360, '2X': 1530}", "{'1X': 1040, '12': 1110, '2X': 4500}", "{'1X': 1500, '12': 1280, '2X': 1440}", "{'1X': 1110, '12': 1220, '2X': 2500}", "{'1X': 2370, '12': 1200, '2X': 1140}", "{'1X': 1100, '12': 1220, '2X': 2620}", "{'1X': 1280, '12': 1280, '2X': 1720}"]}
df = pd.DataFrame(data)

# display(df.head(2))
                  time  home_team  away_team                   full_time_result        both_teams_to_score                         double_chance
0  2021-01-12 18:00:00  Sheff Utd  Newcastle  {'1': 2400, 'X': 3200, '2': 3100}  {'yes': 2000, 'no': 1750}  {'1X': 1360, '12': 1360, '2X': 1530}
1  2021-01-12 20:15:00    Burnley    Man Utd  {'1': 7000, 'X': 4500, '2': 1440}  {'yes': 1900, 'no': 1900}  {'1X': 2620, '12': 1180, '2X': 1100}

# convert time to datetime
df.time = pd.to_datetime(df.time)

# determine if columns are str or dict type
print(type(df.iloc[0, 3]))
[out]:
str

# convert columns from str to dict only if the columns are str type
df.iloc[:, 3:] = df.iloc[:, 3:].applymap(literal_eval)

# normalize columns and rename headers
ftr = pd.json_normalize(df.full_time_result)
ftr.columns = [f'full_time_result_{col}' for col in ftr.columns]

btts = pd.json_normalize(df.both_teams_to_score)
btts.columns = [f'both_teams_to_score_{col}' for col in btts.columns]

dc = pd.json_normalize(df.double_chance)
dc.columns = [f'double_chance_{col}' for col in dc.columns]

# concat the dataframes
df_normalized = pd.concat([df.iloc[:, :3], ftr, btts, dc], axis=1)

`display(df_normalized)`

                 time      home_team       away_team  full_time_result_1  full_time_result_X  full_time_result_2  both_teams_to_score_yes  both_teams_to_score_no  double_chance_1X  double_chance_12  double_chance_2X
0 2021-01-12 18:00:00      Sheff Utd       Newcastle                2400                3200                3100                     2000                    1750              1360              1360              1530
1 2021-01-12 20:15:00        Burnley         Man Utd                7000                4500                1440                     1900                    1900              2620              1180              1100
2 2021-01-12 20:15:00  Wolverhampton         Everton                2450                3200                3000                     1950                    1800              1360              1360              1530
3 2021-01-13 18:00:00       Man City        Brighton                1180                6500               14000                     2040                    1700              1040              1110              4500
4 2021-01-13 20:15:00    Aston Villa       Tottenham                2620                3500                2500                     1570                    2250              1500              1280              1440
5 2021-01-14 20:00:00        Arsenal  Crystal Palace                1500                4000                6500                     1950                    1800              1110              1220              2500
6 2021-01-15 20:00:00         Fulham         Chelsea                5750                4330                1530                     1800                    1950              2370              1200              1140
7 2021-01-16 12:30:00  Wolverhampton       West Brom                1440                4200                7500                     2250                    1570              1100              1220              2620
8 2021-01-16 15:00:00          Leeds        Brighton                2000                3600                3600                     1530                    2370              1280              1280              1720

Consolidated Code

# convert the columns to dict type if they are str type
df.iloc[:, 3:] = df.iloc[:, 3:].applymap(literal_eval)

# normalize all columns
df_list = list()

for col in df.columns[3:]:
    v = pd.json_normalize(df[col])
    v.columns = [f'{col}_{c}' for c in v.columns]
    df_list.append(v)

# combine into one dataframe
df_normalized = pd.concat([df.iloc[:, :3]] + df_list, axis=1)

来源：https://stackoverflow.com/questions/65588159/how-to-normalize-multiple-columns-of-dicts-in-a-pandas-dataframe

标签

python

pandas

formatting