Split (explode) pandas dataframe string entry to separate rows

后端 未结 22 3541
一向
一向 2020-11-21 05:03

I have a pandas dataframe in which one column of text strings contains comma-separated values. I want to split each CSV field and create a new row per entry (as

相关标签:
22条回答
  • 2020-11-21 05:10

    After painful experimentation to find something faster than the accepted answer, I got this to work. It ran around 100x faster on the dataset I tried it on.

    If someone knows a way to make this more elegant, by all means please modify my code. I couldn't find a way that works without setting the other columns you want to keep as the index and then resetting the index and re-naming the columns, but I'd imagine there's something else that works.

    b = DataFrame(a.var1.str.split(',').tolist(), index=a.var2).stack()
    b = b.reset_index()[[0, 'var2']] # var1 variable is currently labeled 0
    b.columns = ['var1', 'var2'] # renaming var1
    
    0 讨论(0)
  • 2020-11-21 05:11

    TL;DR

    import pandas as pd
    import numpy as np
    
    def explode_str(df, col, sep):
        s = df[col]
        i = np.arange(len(s)).repeat(s.str.count(sep) + 1)
        return df.iloc[i].assign(**{col: sep.join(s).split(sep)})
    
    def explode_list(df, col):
        s = df[col]
        i = np.arange(len(s)).repeat(s.str.len())
        return df.iloc[i].assign(**{col: np.concatenate(s)})
    

    Demonstration

    explode_str(a, 'var1', ',')
    
      var1  var2
    0    a     1
    0    b     1
    0    c     1
    1    d     2
    1    e     2
    1    f     2
    

    Let's create a new dataframe d that has lists

    d = a.assign(var1=lambda d: d.var1.str.split(','))
    
    explode_list(d, 'var1')
    
      var1  var2
    0    a     1
    0    b     1
    0    c     1
    1    d     2
    1    e     2
    1    f     2
    

    General Comments

    I'll use np.arange with repeat to produce dataframe index positions that I can use with iloc.

    FAQ

    Why don't I use loc?

    Because the index may not be unique and using loc will return every row that matches a queried index.

    Why don't you use the values attribute and slice that?

    When calling values, if the entirety of the the dataframe is in one cohesive "block", Pandas will return a view of the array that is the "block". Otherwise Pandas will have to cobble together a new array. When cobbling, that array must be of a uniform dtype. Often that means returning an array with dtype that is object. By using iloc instead of slicing the values attribute, I alleviate myself from having to deal with that.

    Why do you use assign?

    When I use assign using the same column name that I'm exploding, I overwrite the existing column and maintain its position in the dataframe.

    Why are the index values repeat?

    By virtue of using iloc on repeated positions, the resulting index shows the same repeated pattern. One repeat for each element the list or string.
    This can be reset with reset_index(drop=True)


    For Strings

    I don't want to have to split the strings prematurely. So instead I count the occurrences of the sep argument assuming that if I were to split, the length of the resulting list would be one more than the number of separators.

    I then use that sep to join the strings then split.

    def explode_str(df, col, sep):
        s = df[col]
        i = np.arange(len(s)).repeat(s.str.count(sep) + 1)
        return df.iloc[i].assign(**{col: sep.join(s).split(sep)})
    

    For Lists

    Similar as for strings except I don't need to count occurrences of sep because its already split.

    I use Numpy's concatenate to jam the lists together.

    import pandas as pd
    import numpy as np
    
    def explode_list(df, col):
        s = df[col]
        i = np.arange(len(s)).repeat(s.str.len())
        return df.iloc[i].assign(**{col: np.concatenate(s)})
    

    0 讨论(0)
  • 2020-11-21 05:12

    Similar question as: pandas: How do I split text in a column into multiple rows?

    You could do:

    >> a=pd.DataFrame({"var1":"a,b,c d,e,f".split(),"var2":[1,2]})
    >> s = a.var1.str.split(",").apply(pd.Series, 1).stack()
    >> s.index = s.index.droplevel(-1)
    >> del a['var1']
    >> a.join(s)
       var2 var1
    0     1    a
    0     1    b
    0     1    c
    1     2    d
    1     2    e
    1     2    f
    
    0 讨论(0)
  • 2020-11-21 05:12

    One-liner using split(___, expand=True) and the level and name arguments to reset_index():

    >>> b = a.var1.str.split(',', expand=True).set_index(a.var2).stack().reset_index(level=0, name='var1')
    >>> b
       var2 var1
    0     1    a
    1     1    b
    2     1    c
    0     2    d
    1     2    e
    2     2    f
    

    If you need b to look exactly like in the question, you can additionally do:

    >>> b = b.reset_index(drop=True)[['var1', 'var2']]
    >>> b
      var1  var2
    0    a     1
    1    b     1
    2    c     1
    3    d     2
    4    e     2
    5    f     2
    
    0 讨论(0)
  • 2020-11-21 05:13

    There is a possibility to split and explode the dataframe without changing the structure of dataframe

    Split and expand data of specific columns

    Input:

        var1    var2
    0   a,b,c   1
    1   d,e,f   2
    
    
    
    #Get the indexes which are repetative with the split 
    df['var1'] = df['var1'].str.split(',')
    df = df.explode('var1')
    

    Out:

        var1    var2
    0   a   1
    0   b   1
    0   c   1
    1   d   2
    1   e   2
    1   f   2
    

    Edit-1

    Split and Expand of rows for Multiple columns

    Filename    RGB                                             RGB_type
    0   A   [[0, 1650, 6, 39], [0, 1691, 1, 59], [50, 1402...   [r, g, b]
    1   B   [[0, 1423, 16, 38], [0, 1445, 16, 46], [0, 141...   [r, g, b]
    

    Re indexing based on the reference column and aligning the column value information with stack

    df = df.reindex(df.index.repeat(df['RGB_type'].apply(len)))
    df = df.groupby('Filename').apply(lambda x:x.apply(lambda y: pd.Series(y.iloc[0])))
    df.reset_index(drop=True).ffill()
    

    Out:

                    Filename    RGB_type    Top 1 colour    Top 1 frequency Top 2 colour    Top 2 frequency
        Filename                            
     A  0       A   r   0   1650    6   39
        1       A   g   0   1691    1   59
        2       A   b   50  1402    49  187
     B  0       B   r   0   1423    16  38
        1       B   g   0   1445    16  46
        2       B   b   0   1419    16  39
    
    0 讨论(0)
  • 2020-11-21 05:15

    Upon adding few bits and pieces from all the solutions on this page, I was able to get something like this(for someone who need to use it right away). parameters to the function are df(input dataframe) and key(column that has delimiter separated string). Just replace with your delimiter if that is different to semicolon ";".

    def split_df_rows_for_semicolon_separated_key(key, df):
        df=df.set_index(df.columns.drop(key,1).tolist())[key].str.split(';', expand=True).stack().reset_index().rename(columns={0:key}).loc[:, df.columns]
        df=df[df[key] != '']
        return df
    
    0 讨论(0)
提交回复
热议问题