Lengthening a DataFrame based on stacking columns within it in Pandas

后端 未结 3 1683
走了就别回头了
走了就别回头了 2021-01-23 04:30

I am looking for a function that achieves the following. It is best shown in an example. Consider:

pd.DataFrame([ [1, 2, 3 ], [4, 5, np.nan ]], columns=[\'x\', \         


        
相关标签:
3条回答
  • 2021-01-23 05:18

    Repeat all the items in first column based on counts of not null values in each row. Then simply create your final dataframe using the rest of not null values in other columns. You can use DataFrame.count() method to count not null values and numpy.repeat() to repeat an array based on a respective count array.

    >>> rest = df.loc[:,'y1':]
    >>> pd.DataFrame({'x': np.repeat(df['x'], rest.count(1)).values,
                      'y': rest.values[rest.notna()]})
    

    Demo:

    >>> df
        x   y1   y2   y3   y4
    0   1  2.0  3.0  NaN  6.0
    1   4  5.0  NaN  9.0  3.0
    2  10  NaN  NaN  NaN  NaN
    3   9  NaN  NaN  6.0  NaN
    4   7  6.0  NaN  NaN  NaN
    
    >>> rest = df.loc[:,'y1':]
    >>> pd.DataFrame({'x': np.repeat(df['x'], rest.count(1)).values,
                      'y': rest.values[rest.notna()]})
       x    y
    0  1  2.0
    1  1  3.0
    2  1  6.0
    3  4  5.0
    4  4  9.0
    5  4  3.0
    6  9  6.0
    7  7  6.0
    
    0 讨论(0)
  • 2021-01-23 05:20

    Here's one based on NumPy, as you were looking for performance -

    def gather_columns(df):
        col_mask = [i.startswith('y') for i in df.columns]
        ally_vals = df.iloc[:,col_mask].values
        y_valid_mask = ~np.isnan(ally_vals)
    
        reps = np.count_nonzero(y_valid_mask, axis=1)
        x_vals = np.repeat(df.x.values, reps)
        y_vals = ally_vals[y_valid_mask]
        return pd.DataFrame({'x':x_vals, 'y':y_vals})
    

    Sample run -

    In [78]: df #(added more cols for variety)
    Out[78]: 
       x  y1   y2   y5   y7
    0  1   2  3.0  NaN  NaN
    1  4   5  NaN  6.0  7.0
    
    In [79]: gather_columns(df)
    Out[79]: 
       x    y
    0  1  2.0
    1  1  3.0
    2  4  5.0
    3  4  6.0
    4  4  7.0
    

    If the y columns are always starting from the second column onwards until the end, we can simply slice the dataframe and hence get further performance boost, like so -

    def gather_columns_v2(df):
        ally_vals = df.iloc[:,1:].values
        y_valid_mask = ~np.isnan(ally_vals)
    
        reps = np.count_nonzero(y_valid_mask, axis=1)
        x_vals = np.repeat(df.x.values, reps)
        y_vals = ally_vals[y_valid_mask]
        return pd.DataFrame({'x':x_vals, 'y':y_vals})
    
    0 讨论(0)
  • 2021-01-23 05:24

    You can use stack to get things done i.e

    pd.DataFrame(df.set_index('x').stack().reset_index(level=0).values,columns=['x','y'])
    
         x    y
    0  1.0  2.0
    1  1.0  3.0
    2  4.0  5.0
    
    0 讨论(0)
提交回复
热议问题