Pandas, for each unique value in one column, get unique values in another column

后端未结

关注

 3  499

I have a dataframe where each row contains various meta-data pertaining to a single Reddit comment (e.g. author, subreddit, comment text).

I want to do the following

相关标签:

3条回答

春和景丽

2020-12-25 08:46

By using sacul's sample data

df['subreddit'].groupby(df['author']).unique().apply(pd.Series)
Out[370]: 
          0    1
author          
a       sr1  sr2
b       sr2  NaN

0 讨论(0)

心在旅途

2020-12-25 09:00
Here are two strategies to do it. No doubt, there are other ways.

Assuming your dataframe looks something like this (obviously with more columns):
```
df = pd.DataFrame({'author':['a', 'a', 'b'], 'subreddit':['sr1', 'sr2', 'sr2']})

>>> df
  author subreddit
0      a       sr1
1      a       sr2
2      b       sr2
...
```
SOLUTION 1: groupby

More straightforward than solution 2, and similar to your first attempt:
```
group = df.groupby('author')

df2 = group.apply(lambda x: x['subreddit'].unique())

# Alternatively, same thing as a one liner:
# df2 = df.groupby('author').apply(lambda x: x['subreddit'].unique())
```
Result:
```
>>> df2
author
a    [sr1, sr2]
b         [sr2]
```
The author is the index, and the single column is the list of all subreddits they are active in (this is how I interpreted how you wanted your output, according to your description).

If you wanted the subreddits each in a separate column, which might be more useable, depending on what you want to do with it, you could just do this after:
```
df2 = df2.apply(pd.Series)
```
Result:
```
>>> df2
          0    1
author          
a       sr1  sr2
b       sr2  NaN
```
Solution 2: Iterate through dataframe

you can make a new dataframe with all unique authors:
```
df2 = pd.DataFrame({'author':df.author.unique()})
```
And then just get the list of all unique subreddits they are active in, assigning it to a new column:
```
df2['subreddits'] = [list(set(df['subreddit'].loc[df['author'] == x['author']])) 
    for _, x in df2.iterrows()]
```
This gives you this:
```
>>> df2
  author  subreddits
0      a  [sr2, sr1]
1      b       [sr2]
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
一个人的身影

2020-12-25 09:02
Using groupby.agg() "aggrgeate" function:

*

DataFrameGroupBy.agg(arg, *args, **kwargs): aggregate using one or more operations over the specified axis. Function to use for aggregating the data. If a function, must either work when passed a DataFrame or when passed to DataFrame.apply
```
df = pd.DataFrame({'numbers': [1, 2, 3, 6, 9], 'colors': ['red', 'white', 'blue', 'red', 'white']}, columns=['numbers', 'colors'])
```
```
df.groupby('colors', as_index=True).agg({'numbers' : {"unique" : lambda x: set(x),
                                                      "nunique" : lambda x : len(set(x))}})
```
0 讨论(0)
发布评论:

提交评论
- 加载中...