Is there an operation in pandas that does the same as flatMap in pyspark?
flatMap example:
>>> rdd = sc.parallelize([2, 3, 4])
>>> sort
there are three steps to solve this question.
import pandas as pd
df = pd.DataFrame({'x': [[1, 2], [3, 4, 5]]})
df_new = df['x'].apply(pd.Series).unstack().reset_index().dropna()
df_new[['level_1',0]]`
I suspect that the answer is "no, not efficiently."
Pandas isn't built for nested data like this. I suspect that the case you're considering in Pandas looks a bit like the following:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'x': [[1, 2], [3, 4, 5]]})
In [3]: df
Out[3]:
x
0 [1, 2]
1 [3, 4, 5]
And that you want something like the following
x
0 1
0 2
1 3
1 4
1 5
It is far more typical to normalize your data in Python before you send it to Pandas. If Pandas did do this then it would probably only be able to operate at slow Python speeds rather than fast C speeds.
Generally one does a bit of munging of data before one uses tabular computation.
There's a hack. I often do something like
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'x': [[1, 2], [3, 4, 5]]})
In [3]: df['x'].apply(pd.Series).unstack().reset_index(drop=True)
Out[3]:
0 1
1 3
2 2
3 4
4 NaN
5 5
dtype: float64
The introduction of NaN
is because the intermediate object creates a MultiIndex
, but for a lot of things you can just drop that:
In [4]: df['x'].apply(pd.Series).unstack().reset_index(drop=True).dropna()
Out[4]:
0 1
1 3
2 2
3 4
5 5
dtype: float64
This trick uses all pandas code, so I would expect it to be reasonably efficient, though it might not like things like very different sized lists.