问题
I have a Google Form which I am using to collect survey data (for this question I'll be using an example form) which has questions which can have multiple answers, selected using a set of checkboxes.
When I get the data from the form and import it into pandas I get this:
Timestamp What sweets do you like?
0 23/11/2013 13:22:30 Chocolate, Toffee, Popcorn
1 23/11/2013 13:22:34 Chocolate
2 23/11/2013 13:22:39 Toffee, Popcorn, Fruit
3 23/11/2013 13:22:45 Fudge, Toffee
4 23/11/2013 13:22:48 Popcorn
I'd like to do statistics on the results of the question (how many people liked Chocolate, what proportion of people liked Toffee etc). The problem is, that all of the answers are within one column, so grouping by that column and asking for counts doesn't work.
Is there a simple way within Pandas to convert this sort of data frame into one where there are multiple columns called Chocolate, Toffee, Popcorn, Fudge and Fruit, and each of those is boolean (1 for yes, 0 for no)? I can't think of a sensible way to do this, and I'm not sure whether it would really help (doing the aggregations that I want to do might be harder in that way).
回答1:
Read in as a fixed width table, droping the first column
In [30]: df = pd.read_fwf(StringIO(data),widths=[3,20,27]).drop(['Unnamed: 0'],axis=1)
In [31]: df
Out[31]:
Timestamp What sweets do you like0
0 23/11/2013 13:22:34 Chocolate
1 23/11/2013 13:22:39 Toffee, Popcorn, Fruit
2 23/11/2013 13:22:45 Fudge, Toffee
3 23/11/2013 13:22:48 Popcorn
Make the timestamp into a proper datetime64 dtype (not necessary for this exercise), but almost always what you want.
In [32]: df['Timestamp'] = pd.to_datetime(df['Timestamp'])
New column names
In [33]: df.columns = ['date','sweets']
In [34]: df
Out[34]:
date sweets
0 2013-11-23 13:22:34 Chocolate
1 2013-11-23 13:22:39 Toffee, Popcorn, Fruit
2 2013-11-23 13:22:45 Fudge, Toffee
3 2013-11-23 13:22:48 Popcorn
In [35]: df.dtypes
Out[35]:
date datetime64[ns]
sweets object
dtype: object
Split the sweet column from a string into a list
In [37]: df['sweets'].str.split(',\s*')
Out[37]:
0 [Chocolate]
1 [Toffee, Popcorn, Fruit]
2 [Fudge, Toffee]
3 [Popcorn]
Name: sweets, dtype: object
The key step, this creates a dummy matrix for where the values exist
In [38]: df['sweets'].str.split(',\s*').apply(lambda x: Series(1,index=x))
Out[38]:
Chocolate Fruit Fudge Popcorn Toffee
0 1 NaN NaN NaN NaN
1 NaN 1 NaN 1 1
2 NaN NaN 1 NaN 1
3 NaN NaN NaN 1 NaN
Final result where we fill the nans to 0, then astype to bool to make True/False. Then concatate it to the original frame
In [40]: pd.concat([df,df['sweets'].str.split(',\s*').apply(lambda x: Series(1,index=x)).fillna(0).astype(bool)],axis=1)
Out[40]:
date sweets Chocolate Fruit Fudge Popcorn Toffee
0 2013-11-23 13:22:34 Chocolate True False False False False
1 2013-11-23 13:22:39 Toffee, Popcorn, Fruit False True False True True
2 2013-11-23 13:22:45 Fudge, Toffee False False True False True
3 2013-11-23 13:22:48 Popcorn False False False True False
回答2:
How about something like this:
#Create some data
import pandas as pd
import numpy as np
Foods = ['Chocolate, Toffee, Popcorn', 'Chocolate', 'Toffee, Popcorn, Fruit', 'Fudge, Toffee', 'Popcorn']
Dates = ['23/11/2013 13:22:30', '23/11/2013 13:22:34', '23/11/2013 13:22:39', '23/11/2013 13:22:45', '23/11/2013 13:22:48']
DF = pd.DataFrame(Foods, index = Dates, columns = ['Sweets'])
#create unique list of foods
UniqueFoods = ['Chocolate', 'Toffee', 'Popcorn', 'Fruit']
# Create new data frame withy columns for each food type, with indenitcal index
DFTransformed = pd.DataFrame({Food: 0 for Food in UniqueFoods}, index = DF.index)
#iterate through your data and modify the second data frame according to your wishes
for row in DF.index:
for Food in UniqueFoods:
if Food in DF['Sweets'][row]:
DFTransformed[Food][row] = 1
DFTransformed
来源:https://stackoverflow.com/questions/20162926/process-multiple-answer-questionnaire-from-google-forms-results-with-pandas