pyspark's flatMap in pandas

前端 未结 3 1259
南旧
南旧 2021-02-04 12:06

Is there an operation in pandas that does the same as flatMap in pyspark?

flatMap example:

>>> rdd = sc.parallelize([2, 3, 4])
>>> sort         


        
3条回答
  •  醉梦人生
    2021-02-04 13:03

    There's a hack. I often do something like

    In [1]: import pandas as pd
    
    In [2]: df = pd.DataFrame({'x': [[1, 2], [3, 4, 5]]})
    
    In [3]: df['x'].apply(pd.Series).unstack().reset_index(drop=True)
    Out[3]:
    0     1
    1     3
    2     2
    3     4
    4   NaN
    5     5
    dtype: float64
    

    The introduction of NaN is because the intermediate object creates a MultiIndex, but for a lot of things you can just drop that:

    In [4]: df['x'].apply(pd.Series).unstack().reset_index(drop=True).dropna()
    Out[4]:
    0    1
    1    3
    2    2
    3    4
    5    5
    dtype: float64
    

    This trick uses all pandas code, so I would expect it to be reasonably efficient, though it might not like things like very different sized lists.

提交回复
热议问题