目录
1.axis与skipna参数的使用
注意:np.nan
表示空值
# 建立数据集 import numpy as np import pandas as pd df = pd.DataFrame({'key1':[4,5,6,np.nan,2], # np.nan表示空值 'key2':[1,2,np.nan,15,10], 'key3':[1,2,3,'m','n'], 'key4':['a1','a2','a3','a4','a5']},index = ['a','b','c','d','e'] ) print(df) print('-----------------分割线-------------------') # 打印每列数据类型 print(df['key1'].dtype,df['key2'].dtype,df['key3'].dtype,df['key4'].dtype) # 求整体平均值(只会对数字列进行统计) print('----------------整体平均------------------') print(df.mean()) # 根据索引求平均值 print('key2列的平均值:',df['key2'].mean()) print('----------------axis参数------------------') # 单独统计一列的平均值,axis=0表示以列计算 axis=1表示以行计算,默认axis=0 m1 = df.mean(axis=1) print(m1) print('----------------skipna参数------------------') # skipna参数:是否忽略NaN,默认True,如False,有NaN的列统计结果仍未NaN m2 = df.mean(skipna=False) print(m2)
输出结果:
2.常用统计方法
其他随机方法列表如下:
- count:统计非Na值的数量
- min:统计最小值
- max:统计最大值
- quantile:统计分位数,参数q确定位置,例如:quantile(q=0.75)
- sum:求和
- median:求中位数
- std:标准差
- var:方差
- skew:偏度
- kurt:峰度
实战演练:
import numpy as np import pandas as pd df = pd.DataFrame({'key1':[1,2,3,4,25,6,7,8,9,10], 'key2':[10,20,30,300,50,60,70,80,90,100]}) print(df) print('-' * 20) # 求最大值 print("计数count:") print(df.count()) print('-' * 20) print("最小值min:") print(df.min()) print('-' * 20) print("最大值min:") print(df.max()) print('-' * 20) print("第三分位数quantile:") print(df.quantile(0.75)) print('-' * 20) print("求和sum:") print(df.sum()) print('-' * 20) print("求中位数median:") print(df.median()) print('-' * 20) print("求标准差std:") print(df.std()) print('-' * 20) print("求方差var:") print(df.var()) print('-' * 20) print("求偏度skew:") print(df.skew()) print('-' * 20) print("求峰度kurt:") print(df.kurt())
输出结果:
3.求累计值:cumsum(累计和),累计积
import numpy as np import pandas as pd df = pd.DataFrame({'key1':[1,2,3,4,25,6,7,8,9,10], 'key2':[10,20,30,300,50,60,70,80,90,100]}) print(df) print('-------------累计和&累计积-----------------') df['key1累计和'] = df['key1'].cumsum() df['key2累计和'] = df['key2'].cumsum() df['key1累计积'] = df['key1'].cumsum() df['key2累计积'] = df['key1'].cumsum() print(df)
输出结果:
4.唯一值:unique
import pandas as pd s = pd.Series(list('abdsssdsd')) print(s) # 求唯一值 sq = s.unique() print('-' * 50) print(sq,type(sq))
输出结果:
5.值计数:value_counts()
值计数主要针对新的Series,计算出不同值出现的频率
import pandas as pd s = pd.Series(list('abdsssdsd')) print(s) print('-' * 50) print(s.value_counts(sort = False)) # sort参数:排序,默认为True
输出结果:
6.成员资格:isin()
判断某个数据集的元素是否在另外一个数据集中,返回结果为布尔值。
s = pd.Series(np.arange(10,13)) df = pd.DataFrame({'key1':list('asdc'), 'key2':np.arange(4,8)}) print(s) print('-' * 50) print(df) print('-' * 50) print(s.isin([12,15])) print('-' * 50) print(df.isin(['a','bc','10',8]))
输出结果:
来源:https://www.cnblogs.com/OliverQin/p/12324252.html