Process multiple-answer questionnaire (from Google Forms) results with pandas

问题

I have a Google Form which I am using to collect survey data (for this question I'll be using an example form) which has questions which can have multiple answers, selected using a set of checkboxes.

When I get the data from the form and import it into pandas I get this:

             Timestamp    What sweets do you like?
0  23/11/2013 13:22:30  Chocolate, Toffee, Popcorn
1  23/11/2013 13:22:34                   Chocolate
2  23/11/2013 13:22:39      Toffee, Popcorn, Fruit
3  23/11/2013 13:22:45               Fudge, Toffee
4  23/11/2013 13:22:48                     Popcorn

I'd like to do statistics on the results of the question (how many people liked Chocolate, what proportion of people liked Toffee etc). The problem is, that all of the answers are within one column, so grouping by that column and asking for counts doesn't work.

Is there a simple way within Pandas to convert this sort of data frame into one where there are multiple columns called Chocolate, Toffee, Popcorn, Fudge and Fruit, and each of those is boolean (1 for yes, 0 for no)? I can't think of a sensible way to do this, and I'm not sure whether it would really help (doing the aggregations that I want to do might be harder in that way).

回答1:

Read in as a fixed width table, droping the first column

In [30]: df = pd.read_fwf(StringIO(data),widths=[3,20,27]).drop(['Unnamed: 0'],axis=1)

In [31]: df
Out[31]: 
             Timestamp What sweets do you like0
0  23/11/2013 13:22:34                Chocolate
1  23/11/2013 13:22:39   Toffee, Popcorn, Fruit
2  23/11/2013 13:22:45            Fudge, Toffee
3  23/11/2013 13:22:48                  Popcorn

Make the timestamp into a proper datetime64 dtype (not necessary for this exercise), but almost always what you want.

In [32]: df['Timestamp'] = pd.to_datetime(df['Timestamp'])

New column names

In [33]: df.columns = ['date','sweets']

In [34]: df
Out[34]: 
                 date                  sweets
0 2013-11-23 13:22:34               Chocolate
1 2013-11-23 13:22:39  Toffee, Popcorn, Fruit
2 2013-11-23 13:22:45           Fudge, Toffee
3 2013-11-23 13:22:48                 Popcorn

In [35]: df.dtypes
Out[35]: 
date      datetime64[ns]
sweets            object
dtype: object

Split the sweet column from a string into a list

In [37]: df['sweets'].str.split(',\s*')
Out[37]: 
0                 [Chocolate]
1    [Toffee, Popcorn, Fruit]
2             [Fudge, Toffee]
3                   [Popcorn]
Name: sweets, dtype: object

The key step, this creates a dummy matrix for where the values exist

In [38]: df['sweets'].str.split(',\s*').apply(lambda x: Series(1,index=x))
Out[38]: 
   Chocolate  Fruit  Fudge  Popcorn  Toffee
0          1    NaN    NaN      NaN     NaN
1        NaN      1    NaN        1       1
2        NaN    NaN      1      NaN       1
3        NaN    NaN    NaN        1     NaN

Final result where we fill the nans to 0, then astype to bool to make True/False. Then concatate it to the original frame

In [40]: pd.concat([df,df['sweets'].str.split(',\s*').apply(lambda x: Series(1,index=x)).fillna(0).astype(bool)],axis=1)
Out[40]: 
                 date                  sweets Chocolate  Fruit  Fudge Popcorn Toffee
0 2013-11-23 13:22:34               Chocolate      True  False  False   False  False
1 2013-11-23 13:22:39  Toffee, Popcorn, Fruit     False   True  False    True   True
2 2013-11-23 13:22:45           Fudge, Toffee     False  False   True   False   True
3 2013-11-23 13:22:48                 Popcorn     False  False  False    True  False

回答2:

How about something like this:

#Create some data
import pandas as pd
import numpy as np
Foods = ['Chocolate, Toffee, Popcorn', 'Chocolate', 'Toffee, Popcorn, Fruit', 'Fudge,     Toffee', 'Popcorn']
Dates = ['23/11/2013 13:22:30', '23/11/2013 13:22:34', '23/11/2013 13:22:39', '23/11/2013 13:22:45', '23/11/2013 13:22:48']
DF = pd.DataFrame(Foods, index = Dates, columns = ['Sweets'])

#create unique list of foods
UniqueFoods = ['Chocolate', 'Toffee', 'Popcorn', 'Fruit']

# Create new data frame withy columns for each food type, with indenitcal index
DFTransformed = pd.DataFrame({Food: 0 for Food in UniqueFoods}, index = DF.index)

 #iterate through your data and modify the second data frame according to your wishes
for row in DF.index:
    for Food in UniqueFoods:
        if Food in DF['Sweets'][row]:
            DFTransformed[Food][row] = 1
DFTransformed

来源：https://stackoverflow.com/questions/20162926/process-multiple-answer-questionnaire-from-google-forms-results-with-pandas

标签

python

pandas

google-form