How to split/expand a string value into several pandas DataFrame rows?

前端 未结 3 1502
盖世英雄少女心
盖世英雄少女心 2020-11-27 22:30

Let\'s say my DataFrame df is created like this:

df = pd.DataFrame({\"title\" : [\"Robin Hood\", \"Madagaskar\"],
                  \"genres\" :         


        
相关标签:
3条回答
  • 2020-11-27 23:01

    Since pandas >= 0.25.0 we have a native method for this called explode.

    This method unnests each element in a list to a new row and repeats the other columns.

    So first we have to call Series.str.split on our string value to split the string to list of elements.

    >>> df.assign(genres=df['genres'].str.split(', ')).explode('genres')
    
            title     genres
    0  Robin Hood     Action
    0  Robin Hood  Adventure
    1  Madagaskar     Family
    1  Madagaskar  Animation
    1  Madagaskar     Comedy
    
    0 讨论(0)
  • 2020-11-27 23:11

    You can use np.repeat with numpy.concatenate for flattening.

    splitted = df['genres'].str.split(',\s*')
    l = splitted.str.len()
    
    df1 = pd.DataFrame({'title': np.repeat(df['title'].values, l),
                         'genres':np.concatenate(splitted.values)}, columns=['title','genres'])
    print (df1)
            title      genres
    0  Robin Hood      Action
    1  Robin Hood   Adventure
    2  Madagaskar      Family
    3  Madagaskar   Animation
    4  Madagaskar      Comedy
    

    Timings:

    df = pd.concat([df]*100000).reset_index(drop=True)
    
    In [95]: %%timeit
        ...: splitted = df['genres'].str.split(',\s*')
        ...: l = splitted.str.len()
        ...: 
        ...: df1 = pd.DataFrame({'title': np.repeat(df['title'].values, l),
        ...:                      'genres':np.concatenate(splitted.values)}, columns=['title','genres'])
        ...: 
        ...: 
    1 loop, best of 3: 709 ms per loop
    
    In [96]: %timeit (df.set_index('title')['genres'].str.split(',\s*', expand=True).stack().reset_index(name='genre').drop('level_1',1))
    1 loop, best of 3: 750 ms per loop
    
    0 讨论(0)
  • 2020-11-27 23:15
    In [33]: (df.set_index('title')
                ['genres'].str.split(',\s*', expand=True)
                .stack()
                .reset_index(name='genre')
                .drop('level_1',1))
    Out[33]:
            title      genre
    0  Robin Hood     Action
    1  Robin Hood  Adventure
    2  Madagaskar     Family
    3  Madagaskar  Animation
    4  Madagaskar     Comedy
    

    PS here you can find more generic approach.

    0 讨论(0)
提交回复
热议问题