How to group dataframe rows into list in pandas groupby

前端未结

关注

 12  2167

日久生厌 2020-11-21 04:56

I have a pandas data frame df like:

a b
A 1
A 2
B 5
B 5
B 4
C 6

I want to group by the first column and get second col

12条回答

囚心锁ツ (楼主)

2020-11-21 05:52

Answer based on @EdChum's comment on his answer. Comment is this -

groupby is notoriously slow and memory hungry, what you could do is sort by column A, then find the idxmin and idxmax (probably store this in a dict) and use this to slice your dataframe would be faster I think

Let's first create a dataframe with 500k categories in first column and total df shape 20 million as mentioned in question.

df = pd.DataFrame(columns=['a', 'b']) df['a'] = (np.random.randint(low=0, high=500000, size=(20000000,))).astype(str) df['b'] = list(range(20000000)) print(df.shape) df.head()

# Sort data by first column df.sort_values(by=['a'], ascending=True, inplace=True) df.reset_index(drop=True, inplace=True) # Create a temp column df['temp_idx'] = list(range(df.shape[0])) # Take all values of b in a separate list all_values_b = list(df.b.values) print(len(all_values_b))

# For each category in column a, find min and max indexes gp_df = df.groupby(['a']).agg({'temp_idx': [np.min, np.max]}) gp_df.reset_index(inplace=True) gp_df.columns = ['a', 'temp_idx_min', 'temp_idx_max'] # Now create final list_b column, using min and max indexes for each category of a and filtering list of b. gp_df['list_b'] = gp_df[['temp_idx_min', 'temp_idx_max']].apply(lambda x: all_values_b[x[0]:x[1]+1], axis=1) print(gp_df.shape) gp_df.head()

This above code takes 2 minutes for 20 million rows and 500k categories in first column.

0 讨论(0)

查看其它12个回答

发布评论:

提交评论

加载中...

验证码

看不清?

提交回复