I am implementing Gaussian Naive Bayes Algorithm:
# importing modules
import pandas as pd
import numpy as np
# create an empty dataframe
data = pd.DataFrame()
# create our target variable
data["gender"] = ["male","male","male","male",
"female","female","female","female"]
# create our feature variables
data["height"] = [6,5.92,5.58,5.92,5,5.5,5.42,5.75]
data["weight"] = [180,190,170,165,100,150,130,150]
data["foot_size"] = [12,11,12,10,6,8,7,9]
# view the data
print(data)
# create an empty dataframe
person = pd.DataFrame()
# create some feature values for this single row
person["height"] = [6]
person["weight"] = [130]
person["foot_size"] = [8]
# view the data
print(person)
# Priors can be calculated either constants or probability distributions.
# In our example, this is simply the probability of being a gender.
# calculating prior now
# number of males
n_male = data["gender"][data["gender"] == "male"].count()
# number of females
n_female = data["gender"][data["gender"] == "female"].count()
# total people
total_ppl = data["gender"].count()
print ("Male count =",n_male,"and Female count =",n_female)
print ("Total number of persons =",total_ppl)
# number of males divided by the total rows
p_male = n_male / total_ppl
# number of females divided by the total rows
p_female = n_female / total_ppl
print ("Probability of MALE =",p_male,"and FEMALE =",p_female)
# group the data by gender and calculate the means of each feature
data_means = data.groupby("gender").mean()
# view the values
data_means
# group the data by gender and calculate the variance of each feature
data_variance = data.groupby("gender").var()
# view the values
data_variance
data_variance = data.groupby("gender").var()
data_variance["foot_size"][data_variance.index == "male"].values[0]
# means for male
male_height_mean=data_means["height"][data_means.index=="male"].values[0]
male_weight_mean=data_means["weight"][data_means.index=="male"].values[0]
male_footsize_mean=data_means["foot_size"][data_means.index=="male"].values[0]
print (male_height_mean,male_weight_mean,male_footsize_mean)
# means for female
female_height_mean=data_means["height"][data_means.index=="female"].values[0]
female_weight_mean=data_means["weight"][data_means.index=="female"].values[0]
female_footsize_mean=data_means["foot_size"][data_means.index=="female"].values[0]
print (female_height_mean,female_weight_mean,female_footsize_mean)
# variance for male
male_height_var=data_variance["height"][data_variance.index=="male"].values[0]
male_weight_var=data_variance["weight"][data_variance.index=="male"].values[0]
male_footsize_var=data_variance["foot_size"][data_variance.index=="male"].values[0]
print (male_height_var,male_weight_var,male_footsize_var)
# variance for female
female_height_var=data_variance["height"][data_variance.index=="female"].values[0]
female_weight_var=data_variance["weight"][data_variance.index=="female"].values[0]
female_footsize_var=data_variance["foot_size"][data_variance.index=="female"].values[0]
print (female_height_var,female_weight_var,female_footsize_var)
# create a function that calculates p(x | y):
def p_x_given_y(x,mean_y,variance_y):
# input the arguments into a probability density function
p = 1 / (np.sqrt(2 * np.pi * variance_y)) * \
np.exp((-(x - mean_y) ** 2) / (2 * variance_y))
# return p
return p
# numerator of the posterior if the unclassified observation is a male
posterior_numerator_male = p_male * \
p_x_given_y(person["height"][0],male_height_mean,male_height_var) * \
p_x_given_y(person["weight"][0],male_weight_mean,male_weight_var) * \
p_x_given_y(person["foot_size"][0],male_footsize_mean,male_footsize_var)
# numerator of the posterior if the unclassified observation is a female
posterior_numerator_female = p_female * \
p_x_given_y(person["height"][0],female_height_mean,female_height_var) * \
p_x_given_y(person["weight"][0],female_weight_mean,female_weight_var) * \
p_x_given_y(person["foot_size"][0],female_footsize_mean,female_footsize_var)
print ("Numerator of Posterior MALE =",posterior_numerator_male)
print ("Numerator of Posterior FEMALE =",posterior_numerator_female)
if (posterior_numerator_male >= posterior_numerator_female):
print ("Predicted gender is MALE")
else:
print ("Predicted gender is FEMALE")
When we are calculating the probability, we are calculating it using the Gaussian PDF:
$$ P(x) = \frac{1}{\sqrt {2 \pi {\sigma}^2}} e^{\frac{-(x- \mu)^2}{2 {\sigma}^2}} $$
My question is that the above equation is that of a PDF. To calculate probability, we have to integrate it over an area dx.
$ \int_{x0}^{x1} P(x)dx $
But in the above program, we are plugging the value of x and calculating the probability. Is that correct? Why? I have seen most of the articles calculating the probability ib the same manner.
If this is the wrong way to calculate the probability in the Naive Bayes Classifier, then what is the correct method?
来源:https://stackoverflow.com/questions/57388105/how-to-calculate-probability-from-probability-density-function-in-the-naive-baye