数据处理—缺失值处理

我是研究僧i 提交于 2020-03-09 10:03:34

数据处理—缺失值处理

数据缺失主要包括记录缺失和字段信息缺失等情况,其对数据分析会有较大影响,导致结果不确定性更加显著

缺失值的处理:删除记录 / 数据插补 / 不处理

1,删除记录

判断是否有缺失值数据 - isnull,notnull

isnull:缺失值为True,非缺失值为False

notnull:缺失值为False,非缺失值为True

import numpy as npimport pandas as pdimport matplotlib.pyplot as pltfrom scipy import stats% matplotlib inline​s = pd.Series([12,33,45,23,np.nan,np.nan,66,54,np.nan,99])df = pd.DataFrame({'value1':[12,33,45,23,np.nan,np.nan,66,54,np.nan,99,190],                'value2':['a','b','c','d','e',np.nan,np.nan,'f','g',np.nan,'g']})​# 创建数据​print(s.isnull()) # Series直接判断是否是缺失值,返回一个Seriesprint(df.notnull()) # Dataframe直接判断是否是缺失值,返回一个Seriesprint(df['value1'].notnull()) # 通过索引判断print('------')​s2 = s[s.isnull() == False]  df2 = df[df['value2'].notnull()]   # 注意和 df2 = df[df['value2'].notnull()] ['value1'] 的区别print(s2)print(df2)​# 筛选非缺失值

 

 

 

删除缺失值 - dropna

s = pd.Series([12,33,45,23,np.nan,np.nan,66,54,np.nan,99])df = pd.DataFrame({'value1':[12,33,45,23,np.nan,np.nan,66,54,np.nan,99,190],                'value2':['a','b','c','d','e',np.nan,np.nan,'f','g',np.nan,'g']})​# 创建数据​s.dropna(inplace = True)df2 = df['value1'].dropna()print(s)print(df2)​# drop方法:可直接用于Series,Dataframe​# 注意inplace参数,默认False → 生成新的值

填充/替换缺失数据 - fillna、replace

s = pd.Series([12,33,45,23,np.nan,np.nan,66,54,np.nan,99])df = pd.DataFrame({'value1':[12,33,45,23,np.nan,np.nan,66,54,np.nan,99,190],                'value2':['a','b','c','d','e',np.nan,np.nan,'f','g',np.nan,'g']})​# 创建数据​s.fillna(0,inplace = True)print(s)print('------')​# s.fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None, **kwargs)​# value:填充值​# 注意inplace参数​df['value1'].fillna(method = 'pad',inplace = True)print(df)print('------')​# method参数:​# pad / ffill → 用之前的数据填充​# backfill / bfill → 用之后的数据填充​s = pd.Series([1,1,1,1,2,2,2,3,4,5,np.nan,np.nan,66,54,np.nan,99])s.replace(np.nan,'缺失数据',inplace = True)print(s)print('------')​# df.replace(to_replace=None, value=None, inplace=False, limit=None, regex=False, method='pad', axis=None)​# to_replace → 被替换的值​# value → 替换值​s.replace([1,2,3],np.nan,inplace = True)print(s)​# 多值用np.nan代替

2,缺失值插补

几种思路:均值/中位数/众数插补、临近值插补、插值法

(1)均值/中位数/众数插补

s = pd.Series([1,2,3,np.nan,3,4,5,5,5,5,np.nan,np.nan,6,6,7,12,2,np.nan,3,4])#print(s)print('------')​# 创建数据​u = s.mean()     # 均值me = s.median() # 中位数mod = s.mode()   # 众数print('均值为:%.2f, 中位数为:%.2f' % (u,me))print('众数为:', mod.tolist())print('------')​# 分别求出均值/中位数/众数​s.fillna(u,inplace = True)print(s)​# 用均值填补

 

 

(2)临近值插补

s = pd.Series([1,2,3,np.nan,3,4,5,5,5,5,np.nan,np.nan,6,6,7,12,2,np.nan,3,4])#print(s)print('------')​# 创建数据​s.fillna(method = 'ffill',inplace = True)print(s)​# 用前值插补

 

 

(3)插值法 —— 拉格朗日插值法

 

 

from scipy.interpolate import lagrangex = [3, 6, 9]y = [10, 8, 4]print(lagrange(x,y))print(type(lagrange(x,y)))​# 的输出值为的是多项式的n个系数​# 这里输出3个值,分别为a0,a1,a2​# y = a0 * x**2 + a1 * x + a2 → y = -0.11111111 * x**2 + 0.33333333 * x + 10​print('插值10为:%.2f' % lagrange(x,y)(10))print('------')​# -0.11111111*100 + 0.33333333*10 + 10 = -11.11111111 + 3.33333333 +10 = 2.22222222

(3)插值法 —— 拉格朗日插值法,实际运用

data = pd.Series(np.random.rand(100)*100)data[3,6,33,56,45,66,67,80,90] = np.nanprint(data.head())print('总数据量:%i' % len(data))print('------')​# 创建数据​data_na = data[data.isnull()]print('缺失值数据量:%i' % len(data_na))print('缺失数据占比:%.2f%%' % (len(data_na) / len(data) * 100))​# 缺失值的数量​data_c = data.fillna(data.median()) # 中位数填充缺失值fig,axes = plt.subplots(1,4,figsize = (20,5))data.plot.box(ax = axes[0],grid = True,title = '数据分布')data.plot(kind = 'kde',style = '--r',ax = axes[1],grid = True,title = '删除缺失值',xlim = [-50,150])data_c.plot(kind = 'kde',style = '--b',ax = axes[2],grid = True,title = '缺失值填充中位数',xlim = [-50,150])​# 密度图查看缺失值情况​def na_c(s,n,k=5):  y = s[list(range(n-k,n+1+k))] # 取数  y = y[y.notnull()] # 剔除空值  return(lagrange(y.index,list(y))(n))​# 创建函数,做插值,由于数据量原因,以空值前后5个数据(共10个数据)为例做插值​na_re = []for i in range(len(data)):  if data.isnull()[i]:      data[i] = na_c(data,i)      print(na_c(data,i))      na_re.append(data[i])data.dropna(inplace=True) # 清除插值后仍存在的缺失值data.plot(kind = 'kde',style = '--k',ax = axes[3],grid = True,title = '拉格朗日插值后',xlim = [-50,150])print('finished!')​# 缺失值插值

 

 

 

 

 

 

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!