Pandas - Conditional Probability of a given specific b

后端未结

关注

 5  1085

I have DataFrame with two columns of \"a\" and \"b\". How can I find the conditional probability of \"a\" given specific \"b\"?

df.groupby(\'a\').groupby(\


                      
              相关标签:


      
      
        
          5条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  广开言路        
                
              
                            
                2021-01-03 03:00
              
            
            
                                                                       
Answer:
This is possible to do using Pandas crosstab function. Given the description of the problem where Dataframe is called 'df', with columns 'a' and 'b'
pd.crosstab(df.a, df.b, normalize='columns')
Will return a Dataframe representing P(a | b)
https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.crosstab.html
Explanation:
Consider the DataFrame:
df = pd.DataFrame({'a':['x', 'x', 'x', 'y', 'y', 'y', 'y', 'z'],
                   'b':['1', '2', '3', '4','5', '1', '2', '3']})

Looking at columns a and b
df[["a", "b"]]
We have
    a   b
0   x   1
1   x   2
2   x   3
3   y   4
4   y   5
5   y   1
6   y   2
7   z   3

Then
pd.crosstab(df.a, df.b)
returns the frequency table of df.a and df.b with the rows being values of df.a and the columns being values of df.b
b   1   2   3   4   5
a                   
x   1   1   1   0   0
y   1   1   0   1   1
z   0   0   1   0   0

We can instead use the normalize keyword to get the table of conditional probabilities P(a | b)
pd.crosstab(df.a, df.b, normalize='columns')
Which will normalize based on column value, or in our case, return a DataFrame where the columns represent the conditional probabilities P(a | b=B) for specific values of B
b    1   2   3   4   5
a
x   0.5 0.5 0.5 0.0 0.0
y   0.5 0.5 0.0 1.0 1.0
z   0.0 0.0 0.5 0.0 0.0

Notice, the columns sum to 1.
If we would instead prefer to get P(b | a), we could normalize over the rows
pd.crosstab(df.a, df.b, normalize='rows')
To get
b      1           2           3         4       5
a                   
x   0.333333    0.333333    0.333333    0.00    0.00
y   0.250000    0.250000    0.000000    0.25    0.25
z   0.000000    0.000000    1.000000    0.00    0.00

Where the rows represent the conditional probabilities P(b | a=A) for specific values of A. Notice, the rows sum to 1.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  暗喜        
                
              
                            
                2021-01-03 03:03
              
            
            
                                                                       
You could try this function,

def conprob(pd1,pd2,transpose=1):
    if transpose==0:
        table=pd.crosstab(pd1,pd2)
    else:
        table=pd.crosstab(pd2,pd1)
    cnames=table.columns.values
    weights=1/table[cnames].sum()
    out=table*weights
    pc=table[cnames].sum()/table[cnames].sum().sum()
    table=table.transpose()
    cnames=table.columns.values
    p=table[cnames].sum()/table[cnames].sum().sum()
    out['p']=p
    return out


This return de conditional probability P( row |column )
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  刺人心        
                
              
                            
                2021-01-03 03:11
              
            
            
                                                                       
You can pass in a list to groupby:

df.groupby(['a','b']).count()

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  执笔经年        
                
              
                            
                2021-01-03 03:16
              
            
            
                                                                       
Consider the DataFrame that Maxymoo suggested:

df = pd.DataFrame({'A':['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'], 'B':['one', 'one', 'two', 'three','two', 'two', 'one', 'three'], 'C':np.random.randn(8), 'D':np.random.randn(8)})

df
     A      B         C         D
0  foo    one  0.229206 -1.899999
1  bar    one  0.174972  0.328746
2  foo    two -1.384699 -1.691151
3  bar  three -1.008328 -0.915467
4  foo    two -0.065298 -0.107240
5  bar    two  1.871916  0.798135
6  foo    one  1.589609 -1.682237
7  foo  three  2.292783  0.639595


Lets assume that we are interested to calculate the probability of (y = foo) given x = one: P(y=foo|x=one) = ?

Approach 1: 

df.groupby('B')['A'].value_counts()/df.groupby('B')['A'].count()
B         
one    foo    0.666667
       bar    0.333333
three  foo    0.500000
       bar    0.500000
two    foo    0.666667
       bar    0.333333
dtype: float64


So the answer is: 0.6667

Approach 2:

Probability of x = one: 0.375

df['B'].value_counts()/df['B'].count()
one      0.375
two      0.375
three    0.250
dtype: float64


Probability of y = foo: 0.625

df['A'].value_counts()/df['A'].count()
foo    0.625
bar    0.375
dtype: float64


Probability of (x=one|y=foo): 0.4

df.groupby('A')['B'].value_counts()/df.groupby('A')['B'].count()
A         
bar  one      0.333333
     two      0.333333
     three    0.333333
foo  one      0.400000
     two      0.400000
     three    0.200000
dtype: float64


Therefore: P(y=foo|x=one) = P(x=one|y=foo)*P(y=foo)/P(x=one) = 0.4 * 0.625 / 0.375 = 0.6667
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  夕颜        
                
              
                            
                2021-01-03 03:17
              
            
            
                                                                       
To find the total number of class b for each instance of class a you would do

df.groupby('a').b.value_counts()


For example, create a DataFrame as below:

df = pd.DataFrame({'A':['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'], 'B':['one', 'one', 'two', 'three','two', 'two', 'one', 'three'], 'C':np.random.randn(8), 'D':np.random.randn(8)})

     A      B         C         D
0  foo    one -1.565185 -0.465763
1  bar    one  2.499516 -0.941229
2  foo    two -0.091160  0.689009
3  bar  three  1.358780 -0.062026
4  foo    two -0.800881 -0.341930
5  bar    two -0.236498  0.198686
6  foo    one -0.590498  0.281307
7  foo  three -1.423079  0.424715


Then:

df.groupby('A')['B'].value_counts()

A
bar  one      1
     two      1
     three    1
foo  one      2
     two      2
     three    1


To convert this to a conditional probability, you need to divide by the total size of each group. 

You can either do it with another groupby:

df.groupby('A')['B'].value_counts() / df.groupby('A')['B'].count()

A
bar  one      0.333333
     two      0.333333
     three    0.333333
foo  one      0.400000
     two      0.400000
     three    0.200000
dtype: float64


Or you can apply a lambda function onto the groups:

df.groupby('a').b.apply(lambda g: g.value_counts()/len(g))

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复