Fast way to split column into multiple rows in Pandas

后端 未结 3 1448
南笙
南笙 2021-02-02 00:07

I have the following data frame:

import pandas as pd
df = pd.DataFrame({ \'gene\':[\"foo\",
                            \"bar // lal\",
                                  


        
3条回答
  •  北恋
    北恋 (楼主)
    2021-02-02 00:40

    TBH I think we need a fast built-in way of normalizing elements like this.. although since I've been out of the loop for a bit for all I know there is one by now, and I just don't know it. :-) In the meantime I've been using methods like this:

    def create(n):
        df = pd.DataFrame({ 'gene':["foo",
                                    "bar // lal",
                                    "qux",
                                    "woz"], 
                            'cell1':[5,9,1,7], 'cell2':[12,90,13,87]})
        df = df[["gene","cell1","cell2"]]
        df = pd.concat([df]*n)
        df = df.reset_index(drop=True)
        return df
    
    def orig(df):
        s = df["gene"].str.split(' // ').apply(pd.Series,1).stack()
        s.index = s.index.droplevel(-1)
        s.name = "Genes"
        del df["gene"]
        return df.join(s)
    
    def faster(df):
        s = df["gene"].str.split(' // ', expand=True).stack()
        i = s.index.get_level_values(0)
        df2 = df.loc[i].copy()
        df2["gene"] = s.values
        return df2
    

    which gives me

    >>> df = create(1)
    >>> df
             gene  cell1  cell2
    0         foo      5     12
    1  bar // lal      9     90
    2         qux      1     13
    3         woz      7     87
    >>> %time orig(df.copy())
    CPU times: user 12 ms, sys: 0 ns, total: 12 ms
    Wall time: 10.2 ms
       cell1  cell2 Genes
    0      5     12   foo
    1      9     90   bar
    1      9     90   lal
    2      1     13   qux
    3      7     87   woz
    >>> %time faster(df.copy())
    CPU times: user 16 ms, sys: 0 ns, total: 16 ms
    Wall time: 12.4 ms
      gene  cell1  cell2
    0  foo      5     12
    1  bar      9     90
    1  lal      9     90
    2  qux      1     13
    3  woz      7     87
    

    for comparable speeds at low sizes, and

    >>> df = create(10000)
    >>> %timeit z = orig(df.copy())
    1 loops, best of 3: 14.2 s per loop
    >>> %timeit z = faster(df.copy())
    1 loops, best of 3: 231 ms per loop
    

    a 60-fold speedup in the larger case. Note that the only reason I'm using df.copy() here is because orig is destructive.

提交回复
热议问题