I have DataFrame with two columns of \"a\" and \"b\". How can I find the conditional probability of \"a\" given specific \"b\"?
df.groupby(\'a\').groupby(\
This is possible to do using Pandas crosstab function. Given the description of the problem where Dataframe is called 'df', with columns 'a' and 'b'
pd.crosstab(df.a, df.b, normalize='columns')
Will return a Dataframe representing P(a | b)
https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.crosstab.html
Consider the DataFrame:
df = pd.DataFrame({'a':['x', 'x', 'x', 'y', 'y', 'y', 'y', 'z'],
'b':['1', '2', '3', '4','5', '1', '2', '3']})
Looking at columns a and b
df[["a", "b"]]
We have
a b
0 x 1
1 x 2
2 x 3
3 y 4
4 y 5
5 y 1
6 y 2
7 z 3
Then
pd.crosstab(df.a, df.b)
returns the frequency table of df.a and df.b with the rows being values of df.a and the columns being values of df.b
b 1 2 3 4 5
a
x 1 1 1 0 0
y 1 1 0 1 1
z 0 0 1 0 0
We can instead use the normalize keyword to get the table of conditional probabilities P(a | b)
pd.crosstab(df.a, df.b, normalize='columns')
Which will normalize based on column value, or in our case, return a DataFrame where the columns represent the conditional probabilities P(a | b=B)
for specific values of B
b 1 2 3 4 5
a
x 0.5 0.5 0.5 0.0 0.0
y 0.5 0.5 0.0 1.0 1.0
z 0.0 0.0 0.5 0.0 0.0
Notice, the columns sum to 1.
If we would instead prefer to get P(b | a)
, we could normalize over the rows
pd.crosstab(df.a, df.b, normalize='rows')
To get
b 1 2 3 4 5
a
x 0.333333 0.333333 0.333333 0.00 0.00
y 0.250000 0.250000 0.000000 0.25 0.25
z 0.000000 0.000000 1.000000 0.00 0.00
Where the rows represent the conditional probabilities P(b | a=A)
for specific values of A. Notice, the rows sum to 1.
You could try this function,
def conprob(pd1,pd2,transpose=1):
if transpose==0:
table=pd.crosstab(pd1,pd2)
else:
table=pd.crosstab(pd2,pd1)
cnames=table.columns.values
weights=1/table[cnames].sum()
out=table*weights
pc=table[cnames].sum()/table[cnames].sum().sum()
table=table.transpose()
cnames=table.columns.values
p=table[cnames].sum()/table[cnames].sum().sum()
out['p']=p
return out
This return de conditional probability P( row |column )
You can pass in a list to groupby:
df.groupby(['a','b']).count()
Consider the DataFrame that Maxymoo suggested:
df = pd.DataFrame({'A':['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'], 'B':['one', 'one', 'two', 'three','two', 'two', 'one', 'three'], 'C':np.random.randn(8), 'D':np.random.randn(8)})
df
A B C D
0 foo one 0.229206 -1.899999
1 bar one 0.174972 0.328746
2 foo two -1.384699 -1.691151
3 bar three -1.008328 -0.915467
4 foo two -0.065298 -0.107240
5 bar two 1.871916 0.798135
6 foo one 1.589609 -1.682237
7 foo three 2.292783 0.639595
Lets assume that we are interested to calculate the probability of (y = foo) given x = one: P(y=foo|x=one) = ?
Approach 1:
df.groupby('B')['A'].value_counts()/df.groupby('B')['A'].count()
B
one foo 0.666667
bar 0.333333
three foo 0.500000
bar 0.500000
two foo 0.666667
bar 0.333333
dtype: float64
So the answer is: 0.6667
Approach 2:
Probability of x = one: 0.375
df['B'].value_counts()/df['B'].count()
one 0.375
two 0.375
three 0.250
dtype: float64
Probability of y = foo: 0.625
df['A'].value_counts()/df['A'].count()
foo 0.625
bar 0.375
dtype: float64
Probability of (x=one|y=foo): 0.4
df.groupby('A')['B'].value_counts()/df.groupby('A')['B'].count()
A
bar one 0.333333
two 0.333333
three 0.333333
foo one 0.400000
two 0.400000
three 0.200000
dtype: float64
Therefore: P(y=foo|x=one) = P(x=one|y=foo)*P(y=foo)/P(x=one) = 0.4 * 0.625 / 0.375 = 0.6667
To find the total number of class b
for each instance of class a
you would do
df.groupby('a').b.value_counts()
For example, create a DataFrame as below:
df = pd.DataFrame({'A':['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'], 'B':['one', 'one', 'two', 'three','two', 'two', 'one', 'three'], 'C':np.random.randn(8), 'D':np.random.randn(8)})
A B C D
0 foo one -1.565185 -0.465763
1 bar one 2.499516 -0.941229
2 foo two -0.091160 0.689009
3 bar three 1.358780 -0.062026
4 foo two -0.800881 -0.341930
5 bar two -0.236498 0.198686
6 foo one -0.590498 0.281307
7 foo three -1.423079 0.424715
Then:
df.groupby('A')['B'].value_counts()
A
bar one 1
two 1
three 1
foo one 2
two 2
three 1
To convert this to a conditional probability, you need to divide by the total size of each group.
You can either do it with another groupby:
df.groupby('A')['B'].value_counts() / df.groupby('A')['B'].count()
A
bar one 0.333333
two 0.333333
three 0.333333
foo one 0.400000
two 0.400000
three 0.200000
dtype: float64
Or you can apply a lambda
function onto the groups:
df.groupby('a').b.apply(lambda g: g.value_counts()/len(g))