Pandas, for each unique value in one column, get unique values in another column

后端 未结 3 499
情话喂你
情话喂你 2020-12-25 08:01

I have a dataframe where each row contains various meta-data pertaining to a single Reddit comment (e.g. author, subreddit, comment text).

I want to do the following

相关标签:
3条回答
  • 2020-12-25 08:46

    By using sacul's sample data

    df['subreddit'].groupby(df['author']).unique().apply(pd.Series)
    Out[370]: 
              0    1
    author          
    a       sr1  sr2
    b       sr2  NaN
    
    0 讨论(0)
  • 2020-12-25 09:00

    Here are two strategies to do it. No doubt, there are other ways.

    Assuming your dataframe looks something like this (obviously with more columns):

    df = pd.DataFrame({'author':['a', 'a', 'b'], 'subreddit':['sr1', 'sr2', 'sr2']})
    
    >>> df
      author subreddit
    0      a       sr1
    1      a       sr2
    2      b       sr2
    ...
    

    SOLUTION 1: groupby

    More straightforward than solution 2, and similar to your first attempt:

    group = df.groupby('author')
    
    df2 = group.apply(lambda x: x['subreddit'].unique())
    
    # Alternatively, same thing as a one liner:
    # df2 = df.groupby('author').apply(lambda x: x['subreddit'].unique())
    

    Result:

    >>> df2
    author
    a    [sr1, sr2]
    b         [sr2]
    

    The author is the index, and the single column is the list of all subreddits they are active in (this is how I interpreted how you wanted your output, according to your description).

    If you wanted the subreddits each in a separate column, which might be more useable, depending on what you want to do with it, you could just do this after:

    df2 = df2.apply(pd.Series)
    

    Result:

    >>> df2
              0    1
    author          
    a       sr1  sr2
    b       sr2  NaN
    

    Solution 2: Iterate through dataframe

    you can make a new dataframe with all unique authors:

    df2 = pd.DataFrame({'author':df.author.unique()})
    

    And then just get the list of all unique subreddits they are active in, assigning it to a new column:

    df2['subreddits'] = [list(set(df['subreddit'].loc[df['author'] == x['author']])) 
        for _, x in df2.iterrows()]
    

    This gives you this:

    >>> df2
      author  subreddits
    0      a  [sr2, sr1]
    1      b       [sr2]
    
    0 讨论(0)
  • 2020-12-25 09:02

    Using groupby.agg() "aggrgeate" function:

    *

    DataFrameGroupBy.agg(arg, *args, **kwargs): aggregate using one or more operations over the specified axis. Function to use for aggregating the data. If a function, must either work when passed a DataFrame or when passed to DataFrame.apply

    df = pd.DataFrame({'numbers': [1, 2, 3, 6, 9], 'colors': ['red', 'white', 'blue', 'red', 'white']}, columns=['numbers', 'colors'])
    


    df.groupby('colors', as_index=True).agg({'numbers' : {"unique" : lambda x: set(x),
                                                          "nunique" : lambda x : len(set(x))}})
    

    0 讨论(0)
提交回复
热议问题