I want to calculate conditional probabilites of ratings(\'A\',\'B\',\'C\') in ratings column.
company model rating type
0 ford mustang
You can use .groupby()
and the built-in .div():
rating_probs = df.groupby('rating').size().div(len(df))
rating
A 0.333333
B 0.500000
C 0.166667
and the conditional probs:
df.groupby(['type', 'rating']).size().div(len(df)).div(rating_probs, axis=0, level='rating')
coupe A 0.500000
B 0.333333
sedan A 0.500000
B 0.666667
C 1.000000
You need add reindex for add 0
values for missing pairs:
mux = pd.MultiIndex.from_product([df['rating'].unique(), df['type'].unique()])
s = (df.groupby(['rating', 'type']).count() / df.groupby('rating').count())['model']
s = s.reindex(mux, fill_value=0)
print (s)
A coupe 0.500000
sedan 0.500000
B coupe 0.333333
sedan 0.666667
C coupe 0.000000
sedan 1.000000
Name: model, dtype: float64
And another solution, thanks Zero:
s.unstack(fill_value=0).stack()
You can use groupby
:
In [2]: df = pd.DataFrame({'company': ['ford', 'chevy', 'ford', 'ford', 'ford', 'toyota'],
'model': ['mustang', 'camaro', 'fiesta', 'focus', 'taurus', 'camry'],
'rating': ['A', 'B', 'C', 'A', 'B', 'B'],
'type': ['coupe', 'coupe', 'sedan', 'sedan', 'sedan', 'sedan']})
In [3]: df.groupby('rating').count()['model'] / len(df)
Out[3]:
rating
A 0.333333
B 0.500000
C 0.166667
Name: model, dtype: float64
In [4]: (df.groupby(['rating', 'type']).count() / df.groupby('rating').count())['model']
Out[4]:
rating type
A coupe 0.500000
sedan 0.500000
B coupe 0.333333
sedan 0.666667
C sedan 1.000000
Name: model, dtype: float64
first, convert into a pandas dataframe. by doing so, you can take advantage of pandas' groupby methods.
collection = {"company": ["ford", "chevy", "ford", "ford", "ford", "toyota"],
"model": ["mustang", "camaro", "fiesta", "focus", "taurus", "camry"],
"rating": ["A", "B", "C", "A", "B", "B"],
"type": ["coupe", "coupe", "sedan", "sedan", "sedan", "sedan"]}
df = pd.DataFrame(collection)
then, groupby based on events (ie rating).
df_s = df.groupby('rating')['type'].value_counts() / df.groupby('rating')['type'].count()
df_f = df_s.reset_index(name='cpt')
df_f.head() # your conditional probability table