Questions about pandas: expanding multivalued column, inverting and grouping

后端 未结 2 1797
无人及你
无人及你 2021-01-06 18:32

I was looking into pandas to do some simple calculations on NLP and text mining but I couldn\'t quite grasp how to do them.

Suppose I have the following data frame,

相关标签:
2条回答
  • 2021-01-06 18:59

    This method should generalize fairly well:

    In [100]: df
    Out[100]:
      gender          name firstname    shingles
    0      M      John Doe      John  [Joh, ohn]
    1      F  Mary Poppins      Mary  [Mar, ary]
    2      F      Jane Doe      Jane  [Jan, ane]
    3      M   John Cusack      John  [Joh, ohn]
    

    First create an "expanded" series where every entry is a shingle. Here, the index of the series is a multindex where the first level represents the shingle position and the second level represents the index of the original DF:

    In [103]: s = df.shingles.apply(lambda x: pandas.Series(x)).unstack();
    Out[103]:
    0  0    Joh
       1    Mar
       2    Jan
       3    Joh
    1  0    ohn
       1    ary
       2    ane
       3    ohn
    

    Next, we can join the created series into the original dataframe. You have to reset the index, dropping the shingle position level. The resulting series has the original index and an entry for each shingle. Merging this into the original dataframe produces:

    In [106]: df2 = df.join(pandas.DataFrame(s.reset_index(level=0, drop=True))); df2
    Out[106]:
      gender          name firstname    shingles    0
    0      M      John Doe      John  [Joh, ohn]  Joh
    0      M      John Doe      John  [Joh, ohn]  ohn
    1      F  Mary Poppins      Mary  [Mar, ary]  Mar
    1      F  Mary Poppins      Mary  [Mar, ary]  ary
    2      F      Jane Doe      Jane  [Jan, ane]  Jan
    2      F      Jane Doe      Jane  [Jan, ane]  ane
    3      M   John Cusack      John  [Joh, ohn]  Joh
    3      M   John Cusack      John  [Joh, ohn]  ohn
    

    Finally, you can do your groupby operation on Gender, unstack the returned series and fill the NaN's with zeroes:

    In [124]: df2.groupby(0, sort=False)['gender'].value_counts().unstack().fillna(0)
    Out[124]:
         F  M
    0
    Joh  0  2
    ohn  0  2
    Mar  1  0
    ary  1  0
    Jan  1  0
    ane  1  0
    
    0 讨论(0)
  • 2021-01-06 19:08

    It might be easier to create the expanded version at the time you create shingles. This question shows how you can use groupby to do this sort of expansion. Here's an example of what you can do after creating the "first name" column:

    def shingles(table, n = 3):
        word = table['first name'].irow(0)
        shingles = [word[i:i + n] for i in range(len(word) - n + 1)]
        cols = {col: table[col].irow(0) for col in table.columns}
        cols['shingle'] = shingles
        return pandas.DataFrame(cols)
    
    >>> df.groupby('name', group_keys=False).apply(shingles)
      first name gender          name shingle
    0       Jane      F      Jane Doe     Jan
    1       Jane      F      Jane Doe     ane
    0       John      M   John Cusack     Joh
    1       John      M   John Cusack     ohn
    0       John      M      John Doe     Joh
    1       John      M      John Doe     ohn
    0       Mary      F  Mary Poppins     Mar
    1       Mary      F  Mary Poppins     ary
    

    (I grouped by name here rather than first name just in case there are duplicate first names, but it assumes the full name is unique.)

    From there you should be able to group and count whatever you like.

    0 讨论(0)
提交回复
热议问题