How to spread a column in a Pandas data frame

后端 未结 2 839
野性不改
野性不改 2020-11-27 08:22

I have the following pandas data frame:

import pandas as pd
import numpy as np
df = pd.DataFrame({
               \'fc\': [100,100,112,1.3,14,125],
                  


        
相关标签:
2条回答
  • 2020-11-27 08:31

    Use pivot or unstack:

    #df = df[['gene_symbol', 'sample_id', 'fc']]
    df = df.pivot(index='gene_symbol',columns='sample_id',values='fc')
    print (df)
    sample_id       S1     S2
    gene_symbol              
    a            100.0    1.3
    b            100.0   14.0
    c            112.0  125.0
    

    df = df.set_index(['gene_symbol','sample_id'])['fc'].unstack(fill_value=0)
    print (df)
    sample_id       S1     S2
    gene_symbol              
    a            100.0    1.3
    b            100.0   14.0
    c            112.0  125.0
    

    But if duplicates, need pivot_table or aggregate with groupby or , mean can be changed to sum, median, ...:

    df = pd.DataFrame({
                   'fc': [100,100,112,1.3,14,125, 100],
                   'sample_id': ['S1','S1','S1','S2','S2','S2', 'S2'],
                   'gene_symbol': ['a', 'b', 'c', 'a', 'b', 'c', 'c'],
                   })
    print (df)
          fc gene_symbol sample_id
    0  100.0           a        S1
    1  100.0           b        S1
    2  112.0           c        S1
    3    1.3           a        S2
    4   14.0           b        S2
    5  125.0           c        S2 <- same c, S2, different fc
    6  100.0           c        S2 <- same c, S2, different fc
    
    df = df.pivot(index='gene_symbol',columns='sample_id',values='fc')
    

    ValueError: Index contains duplicate entries, cannot reshape

    df = df.pivot_table(index='gene_symbol',columns='sample_id',values='fc', aggfunc='mean')
    print (df)
    sample_id       S1     S2
    gene_symbol              
    a            100.0    1.3
    b            100.0   14.0
    c            112.0  112.5
    

    df = df.groupby(['gene_symbol','sample_id'])['fc'].mean().unstack(fill_value=0)
    print (df)
    sample_id       S1     S2
    gene_symbol              
    a            100.0    1.3
    b            100.0   14.0
    c            112.0  112.5
    

    EDIT:

    For cleaning set columns name to None and reset_index:

    df.columns.name = None
    df = df.reset_index()
    print (df)
      gene_symbol     S1     S2
    0           a  100.0    1.3
    1           b  100.0   14.0
    2           c  112.0  112.5
    
    0 讨论(0)
  • 2020-11-27 08:55

    you can also use pd.crosstab() method:

    In [82]: pd.crosstab(index=df.gene_symbol, columns=df.sample_id, 
                         values=df.fc, aggfunc='mean') \
        ...:   .rename_axis(None,1) \
        ...:   .reset_index()
        ...:
    Out[82]:
      gene_symbol     S1     S2
    0           a  100.0    1.3
    1           b  100.0   14.0
    2           c  112.0  125.0
    
    0 讨论(0)
提交回复
热议问题