Lengthening a DataFrame based on stacking columns within it in Pandas

后端 未结 3 1686
走了就别回头了
走了就别回头了 2021-01-23 04:30

I am looking for a function that achieves the following. It is best shown in an example. Consider:

pd.DataFrame([ [1, 2, 3 ], [4, 5, np.nan ]], columns=[\'x\', \         


        
3条回答
  •  广开言路
    2021-01-23 05:20

    Here's one based on NumPy, as you were looking for performance -

    def gather_columns(df):
        col_mask = [i.startswith('y') for i in df.columns]
        ally_vals = df.iloc[:,col_mask].values
        y_valid_mask = ~np.isnan(ally_vals)
    
        reps = np.count_nonzero(y_valid_mask, axis=1)
        x_vals = np.repeat(df.x.values, reps)
        y_vals = ally_vals[y_valid_mask]
        return pd.DataFrame({'x':x_vals, 'y':y_vals})
    

    Sample run -

    In [78]: df #(added more cols for variety)
    Out[78]: 
       x  y1   y2   y5   y7
    0  1   2  3.0  NaN  NaN
    1  4   5  NaN  6.0  7.0
    
    In [79]: gather_columns(df)
    Out[79]: 
       x    y
    0  1  2.0
    1  1  3.0
    2  4  5.0
    3  4  6.0
    4  4  7.0
    

    If the y columns are always starting from the second column onwards until the end, we can simply slice the dataframe and hence get further performance boost, like so -

    def gather_columns_v2(df):
        ally_vals = df.iloc[:,1:].values
        y_valid_mask = ~np.isnan(ally_vals)
    
        reps = np.count_nonzero(y_valid_mask, axis=1)
        x_vals = np.repeat(df.x.values, reps)
        y_vals = ally_vals[y_valid_mask]
        return pd.DataFrame({'x':x_vals, 'y':y_vals})
    

提交回复
热议问题