1、背景想要探究movielens 1M评分数据的评分分布情况是否符合某种分布,做如下假设
2、理论推导
3、算法实现
3.1 数据准备工作
#导入所需要的库
import pandas as pd
import numpy as np
import math
import matplotlib.pyplot as plt
#数据的准备工作
with open("ratings.dat") as file:
data = []
for line in file:
if len(line) != 0:
data.append(int(line.split(",")[2]))
rating 原始数据VS 单列的评分特征
![原始数据](https://img-blog.csdnimg.cn/20191229143208394.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3preXhnczUxOA==,size_16,color_FFFFFF,t_70)
![处理后的特征数据](https://img-blog.csdnimg.cn/20191229143303907.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3preXhnczUxOA==,size_16,color_FFFFFF,t_70)
# 计算theta值
def calTheta(data):
sum = 0
for num in data:
sum += math.log(num,math.e)
return sum/(len(data))
# 计算sigma值
def calSigma(data,theta):
sum = 0
for num in data:
mid = math.log(num,math.e) - theta
mid *= mid
sum += mid
return math.sqrt(sum/(len(data)))
# 画出图像
def drawP1(sigma,theta):
x = np.linspace(1,5,5000)
y = np.array([1/(sigma*i*math.sqrt(2*math.pi))*math.exp(-math.pow(math.log(i,math.e)-theta,2)/(2*sigma*sigma)) for i in x])
maxNum = y.max()*1.2
minNum = 0
plt.plot(x,y)
plt.ylim(minNum,maxNum)
plt.show()
print("sigma is {} theta is {}".format(sigma,theta))
# 主函数
if __name__ == '__main__':
npdata = np.array(data)
theta = calTheta(data)
sigma = calSigma(data,theta)
drawP1(sigma,theta)
运行结果为:
假设二验证:
# 计算theta值
def calTheta(data):
sum = 0
for num in data:
sum += num
return sum/(len(data))
# 计算sigma值
def calSigma(data,theta):
sum = 0
for num in data:
mid = num - theta
mid *= mid
sum += mid
return math.sqrt(sum/(len(data)))
# 画出图像
def drawP1(sigma,theta):
x = np.linspace(-10,10,5000)
y = np.array([1/(sigma*math.sqrt(2*math.pi))*math.exp(-math.pow(i-theta,2)/(2*sigma*sigma)) for i in x])
maxNum = y.max()*1.2
minNum = 0
plt.plot(x,y)
plt.ylim(minNum,maxNum)
plt.show()
# 主函数
if __name__ == '__main__':
# data = readFile(path)
npdata = np.array(data)
theta = calTheta(data)
sigma = calSigma(data,theta)
drawP1(sigma,theta)
运行结果:
实验结果分析
可以看出,由于最开始假设的概率密度函数的不同,即使使用同一份数据训练,也会得到完全不同的结果,也就是说,该实验结果可以表明,参数估计的方法的结果准确性极大地依赖于假设的概率密度分布是否正确。
参考:
来源:CSDN
作者:自由的行走
链接:https://blog.csdn.net/zkyxgs518/article/details/103753460