pyspark's flatMap in pandas

前端 未结 3 1254
南旧
南旧 2021-02-04 12:06

Is there an operation in pandas that does the same as flatMap in pyspark?

flatMap example:

>>> rdd = sc.parallelize([2, 3, 4])
>>> sort         


        
相关标签:
3条回答
  • 2021-02-04 12:44

    there are three steps to solve this question.

    import pandas as pd
    df = pd.DataFrame({'x': [[1, 2], [3, 4, 5]]})
    df_new = df['x'].apply(pd.Series).unstack().reset_index().dropna()
    df_new[['level_1',0]]`
    

    0 讨论(0)
  • 2021-02-04 12:59

    I suspect that the answer is "no, not efficiently."

    Pandas isn't built for nested data like this. I suspect that the case you're considering in Pandas looks a bit like the following:

    In [1]: import pandas as pd
    
    In [2]: df = pd.DataFrame({'x': [[1, 2], [3, 4, 5]]})
    
    In [3]: df
    Out[3]: 
               x
    0     [1, 2]
    1  [3, 4, 5]
    

    And that you want something like the following

        x
    0   1
    0   2
    1   3
    1   4
    1   5
    

    It is far more typical to normalize your data in Python before you send it to Pandas. If Pandas did do this then it would probably only be able to operate at slow Python speeds rather than fast C speeds.

    Generally one does a bit of munging of data before one uses tabular computation.

    0 讨论(0)
  • 2021-02-04 13:03

    There's a hack. I often do something like

    In [1]: import pandas as pd
    
    In [2]: df = pd.DataFrame({'x': [[1, 2], [3, 4, 5]]})
    
    In [3]: df['x'].apply(pd.Series).unstack().reset_index(drop=True)
    Out[3]:
    0     1
    1     3
    2     2
    3     4
    4   NaN
    5     5
    dtype: float64
    

    The introduction of NaN is because the intermediate object creates a MultiIndex, but for a lot of things you can just drop that:

    In [4]: df['x'].apply(pd.Series).unstack().reset_index(drop=True).dropna()
    Out[4]:
    0    1
    1    3
    2    2
    3    4
    5    5
    dtype: float64
    

    This trick uses all pandas code, so I would expect it to be reasonably efficient, though it might not like things like very different sized lists.

    0 讨论(0)
提交回复
热议问题