pandas是一个强大的python数据分析的工具包,是基于NumPy构建的
主要功能:
- 具备对其功能的数据结构DataFrame Series
- 集成时间序列功能
- 提供丰富的数学运算和操作
- 灵活处理缺失数据
安装:pip install pandas
引用:import pandas as pd
Series-一维数据对象
Series是一种类似于一维数组的对象,由一组数据和一组与之相关的数据标签(索引)组成
创建方式
In [206]: import pandas as pd
In [207]: pd.Series([4,7,-5,3])
Out[207]:
0 4
1 7
2 -5
3 3
dtype: int64
In [208]: pd.Series([4,7,-5,3], index=['a','b','c','d'])
Out[208]:
a 4
b 7
c -5
d 3
dtype: int64
In [209]: pd.Series({'a':1,'b':2})
Out[209]:
a 1
b 2
dtype: int64
In [210]: pd.Series(0, index=['a','b','c','d'])
Out[210]:
a 0
b 0
c 0
d 0
dtype: int64
获取值数组和索引数组: values属性和index属性
In [211]: a = pd.Series([4,7,-5,3], index=['a','b','c','d'])
In [212]: a.values
Out[212]: array([ 4, 7, -5, 3], dtype=int64)
In [214]: a.index
Out[214]: Index(['a', 'b', 'c', 'd'], dtype='object')
Series比较像列表(数组)和字典的结合体
Series-使用特性
Series支持array的特性
- 与标量运算 sr*2
In [217]: sr
Out[217]:
a 4
b 7
c -5
d 3
dtype: int64
In [218]: sr * 2
Out[218]:
a 8
b 14
c -10
d 6
dtype: int64
- 与变量运算 sr1+sr2 标签一致的情况下,数值才会相加,否则会增加标签
In [221]: sr2 = pd.Series([1,2,3,4],index=['a','b','c','d'])
In [222]: sr + sr2
Out[222]:
a 5
b 9
c -2
d 7
dtype: int64
- 索引 sr[0],sr[[1,2,4]]
In [224]: sr
Out[224]:
a 4
b 7
c -5
d 3
dtype: int64
In [225]: sr[0]
Out[225]: 4
In [226]: sr[[0,2,3]]
Out[226]:
a 4
c -5
d 3
dtype: int64
- 切片 sr[:2]
In [227]: sr[:2]
Out[227]:
a 4
b 7
dtype: int64
- 通用函数 np.abs(sr)
In [228]: sr
Out[228]:
a 4
b 7
c -5
d 3
dtype: int64
In [229]: np.abs(sr)
Out[229]:
a 4
b 7
c 5
d 3
dtype: int64
- 布尔值过滤
In [230]: sr
Out[230]:
a 4
b 7
c -5
d 3
dtype: int64
In [231]: sr[sr>0]
Out[231]:
a 4
b 7
d 3
dtype: int64
Series支持字典的特性(标签)
- 从字典创建Series Series(dic)
In [232]: pd.Series({'a':1,'b':5})
Out[232]:
a 1
b 5
dtype: int64
- 标签in运算判断 ‘a’ in sr,循环时,默认循环值
In [233]: sr
Out[233]:
a 4
b 7
c -5
d 3
dtype: int64
In [234]: 'a' in sr
Out[234]: True
In [236]: for i in sr:
...: print(i)
4
7
-5
3
- 键索引 sr['a'], sr[['a','b','d']]
In [237]: sr
Out[237]:
a 4
b 7
c -5
d 3
dtype: int64
In [238]: sr['a']
Out[238]: 4
In [239]: sr[['a','c']]
Out[239]:
a 4
c -5
dtype: int64
In [240]: sr['a':'c'] #起始值和终点值都能取到
Out[240]:
a 4
b 7
c -5
dtype: int64
Series-整数索引
如果series对象里的键为整数时,就存在键取值和索引取值搞混的问题,默认是键取值
In [40]: sr = pd.Series(np.arange(10))
In [41]: sr
Out[41]:
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
dtype: int32
In [44]: sr2 = sr[5:].copy()
In [45]: sr2
Out[45]:
5 5
6 6
7 7
8 8
9 9
dtype: int32
In [46]: sr2[5] #取键为5的值,如果是索引,肯定报错
Out[46]: 5
In [47]: sr2[-1]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-47-3882bebf0859> in <module>()
----> 1 sr2[-1]
C:\python server\anaconda\lib\site-packages\pandas\core\series.py in __getitem__(self, key)
621 key = com._apply_if_callable(key, self)
622 try:
--> 623 result = self.index.get_value(self, key)
624
625 if not is_scalar(result):
C:\python server\anaconda\lib\site-packages\pandas\core\indexes\base.py in get_value(self, series, key)
2558 try:
2559 return self._engine.get_value(s, k,
-> 2560 tz=getattr(series.dtype, 'tz', None))
2561 except KeyError as e1:
2562 if len(self) > 0 and self.inferred_type in ['integer', 'boolean']:
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()
KeyError: -1
解决方法:指定取值方法,loc标签取值,iloc就是索引取值
In [48]: sr2.loc[5] #ioc就是 标签 或 键取值
Out[48]: 5
In [50]: sr2.iloc[4] #iloc就下标 或 索引取值
Out[50]: 9
In [51]: sr2.iloc[-1]
Out[51]: 9
Series数据对齐
pandas在进行两个Series对象的运算时,会按索引进行对齐然后计算
如果两个Series对象的索引不完全相同,则结果的索引是两个操作数索引的并集
如果只有一个对象在某个索引下有值,则结果中该索引的值为nan(缺失值)
In [6]: sr1 = pd.Series([4,9,100],index=['a','b','c'])
In [7]: sr1
Out[7]:
a 4
b 9
c 100
dtype: int64
In [8]: sr2 = pd.Series([4,5,6],index=['b','c','d'])
In [9]: sr2
Out[9]:
b 4
c 5
d 6
dtype: int64
In [10]: sr1 + sr2
Out[10]:
a NaN
b 13.0
c 105.0
d NaN
dtype: float64
如果处理缺失值呢,比如上面的,a处理为4,另外一组没有的处理成0
你可以用灵活的算术方法:add,sub,div,mul
sr1 + sr2 等同 sr1.add(sr2),利用函数的填充参数fill_value处理缺失值
In [10]: sr1 + sr2
Out[10]:
a NaN
b 13.0
c 105.0
d NaN
dtype: float64
In [11]: sr1.add(sr2)
Out[11]:
a NaN
b 13.0
c 105.0
d NaN
dtype: float64
In [12]: sr1.add(sr2,fill_value=0)
Out[12]:
a 4.0
b 13.0
c 105.0
d 6.0
dtype: float64
Series-缺失数据
缺失数据:使用NaN(Not a Number)来表示缺失数据,其值等于np.nan,内置的None值也会被当做NaN处理
提供这么几个方法帮助我们处理缺失值
- dropna() 过滤掉值为NaN的行
- fillna 填充缺失数据
- isnull 返回布尔数组,缺失值对应为True
- notnull 返回布尔数组,缺失值对为False
第一种方式:扔掉缺失值
In [61]: sr = sr1 + sr2
In [62]: sr
Out[62]:
a 33.0
b NaN
c 32.0
d 45.0
dtype: float64
In [63]: sr.isnull() #判断是否为nan值
Out[63]:
a False
b True
c False
d False
dtype: bool
In [64]: sr.notnull() #不是nan值,那就可以通过这个,结合series的过滤掉缺失值
Out[64]:
a True
b False
c True
d True
dtype: bool
In [65]: sr[sr.notnull()] #过滤
Out[65]:
a 33.0
c 32.0
d 45.0
dtype: float64
In [66]: sr.dropna() #series对象里本身就提供一个扔掉nan值的方法
Out[66]:
a 33.0
c 32.0
d 45.0
dtype: float64
第二种处理缺失值的方式:填充
In [68]: sr.fillna(0) #填充0,还可以填充均值sr.mean() 均值函数会跳过nan值
Out[68]:
a 33.0
b 0.0
c 32.0
d 45.0
dtype: float64
In [69]: sr = sr.fillna(0) #由于不会对已有对象进行修改,需要重新赋值
DataFrame-二维数据对象
DataFrame是一个表格型的数据结构,含有一组有序的列,DataFrame可以被看做是由Series组成的字典,并且共用一个索引
创建方式
In [3]: pd.DataFrame({'one':[1,2,3,4], 'two':[4,3,2,1]})
Out[3]:
one two
0 1 4
1 2 3
2 3 2
3 4 1
In [4]: pd.DataFrame({'one': pd.Series([1,2,3], index=['a','b','c']), 'two': pd.Series([1,2,3,4],index=['b','a','c','d'])})
Out[4]:
one two
a 1.0 2
b 2.0 1
c 3.0 3
d NaN 4 #如果有缺失值,就以nan值返回
In [5]: pd.DataFrame({'one':[1,2,3,4], 'two':[4,3,2,1]}, index=['a','b','c','d'])
Out[5]:
one two
a 1 4
b 2 3
c 3 2
d 4 1
DataFrame-常用属性
- index 获取索引(行名)
- values 获取值数组
- columns 获取列索引(列名)
In [5]: pd.DataFrame({'one':[1,2,3,4], 'two':[4,3,2,1]}, index=['a','b','c','d'])
Out[5]:
one two
a 1 4
b 2 3
c 3 2
d 4 1
In [6]: df = _5
In [7]: df.index
Out[7]: Index(['a', 'b', 'c', 'd'], dtype='object')
In [8]: df.values
Out[8]:
array([[1, 4],
[2, 3],
[3, 2],
[4, 1]], dtype=int64)
In [9]: df.columns
Out[9]: Index(['one', 'two'], dtype='object')
- T 转置,行列对换
In [10]: df
Out[10]:
one two
a 1 4
b 2 3
c 3 2
d 4 1
In [11]: df.T
Out[11]:
a b c d
one 1 2 3 4
two 4 3 2 1
- describe() 获取快速统计,主要统计每列中个数,平均数,最大,最小,标准差,中位数等
In [13]: df.describe() #对列进行统计
Out[13]:
one two
count 4.000000 4.000000 #个数(nan不包括)
mean 2.500000 5.500000 #均值
std 1.290994 1.290994 #标准差
min 1.000000 4.000000 #最小
25% 1.750000 4.750000
50% 2.500000 5.500000 #中位数
75% 3.250000 6.250000
max 4.000000 7.000000 #最大
DataFrame-索引和切片
DataFrame是一个二维数据类型,所以有行索引和列索引
DataFrame同样可以通过标签和位置两种方式来进行索引和切片
列表索引方式
获取时先列后行,支持只获取列,但是不支持只获取行
In [18]: df
Out[18]:
one two
a 1 4
b 2 3
c 3 2
d 4 1
InIn [19]:
In [19]: df['one']['a'] #中括号取值,先列后行
Out[19]: 1
In [20]: df['one'] #单取列
Out[20]:
a 1
b 2
c 3
d 4
Name: one, dtype: int64
In [21]: df['a'] #报错 #中括号取值方式 不支持只取行,因为行不是Series对象,列才是
行列索引方式
- loc属性 标签获取方式(行名和列名获取)
- iloc属性 索引获取方式
使用方法:逗号隔开,前为行索引,后为列索引
行/列索引部分可以是常规索引、切片、布尔值索引、花式索引任意搭配
In [22]: df
Out[22]:
one two
a 1 4
b 2 3
c 3 2
d 4 1
In [23]: df.loc['a',] #loc方式就支持光取行
Out[23]:
one 1
two 4
Name: a, dtype: int64
In [24]: df.loc[['a','c'],] #支持花式索引
Out[24]:
one two
a 1 4
c 3 2
In [25]: df.loc['a':'c','one']
Out[25]:
a 1
b 2
c 3
Name: one, dtype: int64
In [26]: df.iloc[0] #iloc方式就支持光取行
Out[26]:
one 1
two 4
Name: a, dtype: int64
In [27]: df.iloc[0][1]
Out[27]: 4
In [28]: df.iloc[0,1]
Out[28]: 4
DataFrame-数据对齐与缺失数据
DataFrame对象在运算时,同样会进行数据对齐,其行索引和列索引分别对齐
In [29]: pd.DataFrame({'one':[1,2,3,4],'two':[4,5,6,7]}, index=['a','b','c','d'])
Out[29]:
one two
a 1 4
b 2 5
c 3 6
d 4 7
In [30]: df = _29
In [31]: df2 = pd.DataFrame({'two':[7,8,7,8],'one':[8,9,8,8]}, index=['a','c','d','b'])
In [32]: df2
Out[32]:
two one
a 7 8
c 8 9
d 7 8
b 8 8
In [33]: df + df2
Out[33]:
one two
a 9 11
b 10 13
c 12 14
d 12 14
缺失值处理方式一:填充 fillna()
In [35]: df.loc['e', 'one'] = np.nan
In [36]: df.loc['e', 'two'] = 10
In [37]: df.loc['f', 'one'] = np.nan
In [38]: df.loc['f', 'two'] = np.nan
In [39]: df
Out[39]:
one two
a 1.0 4.0
b 2.0 5.0
c 3.0 6.0
d 4.0 7.0
e NaN 10.0
f NaN NaN
In [40]: df.fillna(0)
Out[40]:
one two
a 1.0 4.0
b 2.0 5.0
c 3.0 6.0
d 4.0 7.0
e 0.0 10.0
f 0.0 0.0
缺失值处理方式二:扔掉
- dropna() axis指定操作删除对象类型是行还是列,默认为0就是行,1为列 where指定什么情况下删除,any表示有nan就删除,而all表示行或列中都为nan删除
In [39]: df2.dropna() #默认是how=any
Out[39]:
one two
a 8.0 7.0
c 9.0 8.0
d 8.0 7.0
b 8.0 8.0
In [40]: df2.dropna(how='all') #删除所有列都为nan的行
Out[40]:
one two
a 8.0 7.0
c 9.0 8.0
d 8.0 7.0
b 8.0 8.0
e NaN 10.0
In [41]: df2.dropna(how='any') #删除含nan值的行
Out[41]:
one two
a 8.0 7.0
c 9.0 8.0
d 8.0 7.0
b 8.0 8.0
In [42]: df.loc['a','one'] = np.nan
In [43]: df
Out[43]:
one two
a NaN 4
b 2.0 5
c 3.0 6
d 4.0 7
In [44]: df.dropna(axis=1) #删除含nan值的列
Out[44]:
two
a 4
b 5
c 6
d 7
- isnull()
- notnull()
pandas-常用方法
- mean(axis=0, skipna=True) 对列(行)求平均值,默认0为列
- sum(axis=1) 对列(行)求和
In [45]: df
Out[45]:
one two
a NaN 4
b 2.0 5
c 3.0 6
d 4.0 7
In [46]: df.mean() #默认对列求均值
Out[46]:
one 3.0
two 5.5
dtype: float64
In [47]: df.mean(axis=1) #对行求均值
Out[47]:
a 4.0
b 3.5
c 4.5
d 5.5
dtype: float64
In [48]: df.sum() #对列求和
Out[48]:
one 9.0
two 22.0
dtype: float64
- sort_index(axis=0,ascending=True) 对列(行)索引排序,ascending为True时,为升序,False为降序
- sort_values(by,axis=0,ascending=True) 对列(行)的值排序 by为哪一列或哪一行
In [49]: df.sort_values(by='two') #对某列值进行升序
Out[49]:
one two
a NaN 4
b 2.0 5
c 3.0 6
d 4.0 7
In [50]: df.sort_values(by='two',ascending=False) #对某列进行降序
Out[50]:
one two
d 4.0 7
c 3.0 6
b 2.0 5
a NaN 4
In [52]: df.sort_values(by='a',ascending=False,axis=1) #对某行进行降序
Out[52]:
two one
a 4 NaN
b 5 2.0
c 6 3.0
d 7 4.0
In [53]: df.sort_values(by='one') #nan值不参与排序,放到最后
Out[53]:
one two
b 2.0 5
c 3.0 6
d 4.0 7
a NaN 4
In [54]: df.sort_values(by='one',ascending=False)
Out[54]:
one two
d 4.0 7
c 3.0 6
b 2.0 5
a NaN 4
In [55]: df.sort_index() #按行升序
Out[55]:
one two
a NaN 4
b 2.0 5
c 3.0 6
d 4.0 7
In [56]: df.sort_index(ascending=False) #按行降序
Out[56]:
one two
d 4.0 7
c 3.0 6
b 2.0 5
a NaN 4
In [57]: df.sort_index(axis=1) #按列排
Out[57]:
one two
a NaN 4
b 2.0 5
c 3.0 6
d 4.0 7
In [58]: df.sort_index(ascending=False,axis=1)
Out[58]:
two one
a 4 NaN
b 5 2.0
c 6 3.0
d 7 4.0
其他
- apply(func, axis=0) 将自定义函数应用在各行或者各列上,func可返回标量或者Series
- applymap(func) 将函数应用在DataFrame各个元素上
- map(func) 将函数应用在Series各个元素上
pandas-时间对象处理
生成时间对象数组:date_range
- start 开始时间
- end 结束时间
- periods 时间长度
- freq 时间频率,默认为'D', 可选H(our) W(eek) B(usiness) S(emi-) M(onth) (min)T(es) S(econd), A(year)
In [71]: pd.date_range('2018-01-01',periods=10)
Out[71]:
DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
'2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08',
'2018-01-09', '2018-01-10'],
dtype='datetime64[ns]', freq='D')
In [72]: pd.date_range('2018-01-01','2030-01-01',freq='A')
Out[72]:
DatetimeIndex(['2018-12-31', '2019-12-31', '2020-12-31', '2021-12-31',
'2022-12-31', '2023-12-31', '2024-12-31', '2025-12-31',
'2026-12-31', '2027-12-31', '2028-12-31', '2029-12-31'],
dtype='datetime64[ns]', freq='A-DEC')
时间序列就是以时间对象为索引的Series或DataFrame
datetime对象作为索引时是存储在DatetimeIndex对象中的
In [73]: sr = pd.Series(np.arange(20), index=pd.date_range('2018-01-01', periods=20))
In [74]: sr
Out[74]:
2018-01-01 0
2018-01-02 1
2018-01-03 2
2018-01-04 3
2018-01-05 4
2018-01-06 5
2018-01-07 6
2018-01-08 7
2018-01-09 8
2018-01-10 9
2018-01-11 10
2018-01-12 11
2018-01-13 12
2018-01-14 13
2018-01-15 14
2018-01-16 15
2018-01-17 16
2018-01-18 17
2018-01-19 18
2018-01-20 19
Freq: D, dtype: int32
In [75]: sr.index
Out[75]:
DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
'2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08',
'2018-01-09', '2018-01-10', '2018-01-11', '2018-01-12',
'2018-01-13', '2018-01-14', '2018-01-15', '2018-01-16',
'2018-01-17', '2018-01-18', '2018-01-19', '2018-01-20'],
dtype='datetime64[ns]', freq='D')
时间序列特殊功能:
- 传入'年'或'年月'作为切片方式
In [32]: sr = pd.Series(np.arange(1000),index=pd.date_range('2018-01-01',periods=1000))
In [33]: sr['2018-03'] #切某年的某个月
Out[33]:
2018-03-01 59
2018-03-02 60
2018-03-03 61
2018-03-04 62
2018-03-05 63
2018-03-06 64
2018-03-07 65
2018-03-08 66
2018-03-09 67
2018-03-10 68
2018-03-11 69
2018-03-12 70
2018-03-13 71
2018-03-14 72
2018-03-15 73
2018-03-16 74
2018-03-17 75
2018-03-18 76
2018-03-19 77
2018-03-20 78
2018-03-21 79
2018-03-22 80
2018-03-23 81
2018-03-24 82
2018-03-25 83
2018-03-26 84
2018-03-27 85
2018-03-28 86
2018-03-29 87
2018-03-30 88
2018-03-31 89
Freq: D, dtype: int32
In [35]: sr['2019'] #切某年
Out[35]:
2019-01-01 365
2019-01-02 366
2019-01-03 367
2019-01-04 368
2019-01-05 369
2019-01-06 370
2019-01-07 371
2019-01-08 372
2019-01-09 373
2019-01-10 374
2019-01-11 375
2019-01-12 376
2019-01-13 377
2019-01-14 378
2019-01-15 379
2019-01-16 380
2019-01-17 381
2019-01-18 382
2019-01-19 383
2019-01-20 384
2019-01-21 385
2019-01-22 386
2019-01-23 387
2019-01-24 388
2019-01-25 389
2019-01-26 390
2019-01-27 391
2019-01-28 392
2019-01-29 393
2019-01-30 394
...
2019-12-02 700
2019-12-03 701
2019-12-04 702
2019-12-05 703
2019-12-06 704
2019-12-07 705
2019-12-08 706
2019-12-09 707
2019-12-10 708
2019-12-11 709
2019-12-12 710
2019-12-13 711
2019-12-14 712
2019-12-15 713
2019-12-16 714
2019-12-17 715
2019-12-18 716
2019-12-19 717
2019-12-20 718
2019-12-21 719
2019-12-22 720
2019-12-23 721
2019-12-24 722
2019-12-25 723
2019-12-26 724
2019-12-27 725
2019-12-28 726
2019-12-29 727
2019-12-30 728
2019-12-31 729
Freq: D, Length: 365, dtype: int32
- 传入日期范围作为切片方式
In [36]: sr['2018-11':'2019-01'] #按年月切片
Out[36]:
2018-11-01 304
2018-11-02 305
2018-11-03 306
2018-11-04 307
2018-11-05 308
2018-11-06 309
2018-11-07 310
2018-11-08 311
2018-11-09 312
2018-11-10 313
2018-11-11 314
2018-11-12 315
2018-11-13 316
2018-11-14 317
2018-11-15 318
2018-11-16 319
2018-11-17 320
2018-11-18 321
2018-11-19 322
2018-11-20 323
2018-11-21 324
2018-11-22 325
2018-11-23 326
2018-11-24 327
2018-11-25 328
2018-11-26 329
2018-11-27 330
2018-11-28 331
2018-11-29 332
2018-11-30 333
...
2019-01-02 366
2019-01-03 367
2019-01-04 368
2019-01-05 369
2019-01-06 370
2019-01-07 371
2019-01-08 372
2019-01-09 373
2019-01-10 374
2019-01-11 375
2019-01-12 376
2019-01-13 377
2019-01-14 378
2019-01-15 379
2019-01-16 380
2019-01-17 381
2019-01-18 382
2019-01-19 383
2019-01-20 384
2019-01-21 385
2019-01-22 386
2019-01-23 387
2019-01-24 388
2019-01-25 389
2019-01-26 390
2019-01-27 391
2019-01-28 392
2019-01-29 393
2019-01-30 394
2019-01-31 395
Freq: D, Length: 92, dtype: int32
In [37]: sr['2018-12-03':'2019-01-01'] #按日期切片
Out[37]:
2018-12-03 336
2018-12-04 337
2018-12-05 338
2018-12-06 339
2018-12-07 340
2018-12-08 341
2018-12-09 342
2018-12-10 343
2018-12-11 344
2018-12-12 345
2018-12-13 346
2018-12-14 347
2018-12-15 348
2018-12-16 349
2018-12-17 350
2018-12-18 351
2018-12-19 352
2018-12-20 353
2018-12-21 354
2018-12-22 355
2018-12-23 356
2018-12-24 357
2018-12-25 358
2018-12-26 359
2018-12-27 360
2018-12-28 361
2018-12-29 362
2018-12-30 363
2018-12-31 364
2019-01-01 365
Freq: D, dtype: int32
- 丰富的函数支持:resample(),strftime()
In [38]: sr.resample('W').sum() #按周求和
Out[38]:
2018-01-07 21
2018-01-14 70
2018-01-21 119
2018-01-28 168
2018-02-04 217
2018-02-11 266
2018-02-18 315
2018-02-25 364
2018-03-04 413
2018-03-11 462
2018-03-18 511
2018-03-25 560
2018-04-01 609
2018-04-08 658
2018-04-15 707
2018-04-22 756
2018-04-29 805
2018-05-06 854
2018-05-13 903
2018-05-20 952
2018-05-27 1001
2018-06-03 1050
2018-06-10 1099
2018-06-17 1148
2018-06-24 1197
2018-07-01 1246
2018-07-08 1295
2018-07-15 1344
2018-07-22 1393
2018-07-29 1442
...
2020-03-08 5558
2020-03-15 5607
2020-03-22 5656
2020-03-29 5705
2020-04-05 5754
2020-04-12 5803
2020-04-19 5852
2020-04-26 5901
2020-05-03 5950
2020-05-10 5999
2020-05-17 6048
2020-05-24 6097
2020-05-31 6146
2020-06-07 6195
2020-06-14 6244
2020-06-21 6293
2020-06-28 6342
2020-07-05 6391
2020-07-12 6440
2020-07-19 6489
2020-07-26 6538
2020-08-02 6587
2020-08-09 6636
2020-08-16 6685
2020-08-23 6734
2020-08-30 6783
2020-09-06 6832
2020-09-13 6881
2020-09-20 6930
2020-09-27 5979
Freq: W-SUN, Length: 143, dtype: int32
In [39]: sr.resample('A').sum() #按年求和
Out[39]:
2018-12-31 66430
2019-12-31 199655
2020-12-31 233415
Freq: A-DEC, dtype: int32
In [40]: sr.resample('M').mean() #按月求平均值
Out[40]:
2018-01-31 15.0
2018-02-28 44.5
2018-03-31 74.0
2018-04-30 104.5
2018-05-31 135.0
2018-06-30 165.5
2018-07-31 196.0
2018-08-31 227.0
2018-09-30 257.5
2018-10-31 288.0
2018-11-30 318.5
2018-12-31 349.0
2019-01-31 380.0
2019-02-28 409.5
2019-03-31 439.0
2019-04-30 469.5
2019-05-31 500.0
2019-06-30 530.5
2019-07-31 561.0
2019-08-31 592.0
2019-09-30 622.5
2019-10-31 653.0
2019-11-30 683.5
2019-12-31 714.0
2020-01-31 745.0
2020-02-29 775.0
2020-03-31 805.0
2020-04-30 835.5
2020-05-31 866.0
2020-06-30 896.5
2020-07-31 927.0
2020-08-31 958.0
2020-09-30 986.5
Freq: M, dtype: float64
In [41]: sr.truncate(before='2019-11-12') #截断日期之前的,因为切片能力非常强大,这个已经变的没什么意义了
Out[41]:
2019-11-12 680
2019-11-13 681
2019-11-14 682
2019-11-15 683
2019-11-16 684
2019-11-17 685
2019-11-18 686
2019-11-19 687
2019-11-20 688
2019-11-21 689
2019-11-22 690
2019-11-23 691
2019-11-24 692
2019-11-25 693
2019-11-26 694
2019-11-27 695
2019-11-28 696
2019-11-29 697
2019-11-30 698
2019-12-01 699
2019-12-02 700
2019-12-03 701
2019-12-04 702
2019-12-05 703
2019-12-06 704
2019-12-07 705
2019-12-08 706
2019-12-09 707
2019-12-10 708
2019-12-11 709
...
2020-08-28 970
2020-08-29 971
2020-08-30 972
2020-08-31 973
2020-09-01 974
2020-09-02 975
2020-09-03 976
2020-09-04 977
2020-09-05 978
2020-09-06 979
2020-09-07 980
2020-09-08 981
2020-09-09 982
2020-09-10 983
2020-09-11 984
2020-09-12 985
2020-09-13 986
2020-09-14 987
2020-09-15 988
2020-09-16 989
2020-09-17 990
2020-09-18 991
2020-09-19 992
2020-09-20 993
2020-09-21 994
2020-09-22 995
2020-09-23 996
2020-09-24 997
2020-09-25 998
2020-09-26 999
Freq: D, Length: 320, dtype: int32
pandas-文件处理
读取操作:read_csv
数据文件常用格式:csv(以某间隔符分割数据)
pandas读取文件:从文件名、URL、文件对象中加载数据
- read_csv 默认分隔符为逗号
- read_table 默认分隔符为制表符
参数解析
- sep 指定分隔符,可用正则表达式如'\s+'
- index_col 指定某列作为索引
In [87]: df.to_csv('test.csv',header=True,index=True,na_rep='null',encoding='gbk',columns=['one','two']) #用DataFrame对象的方法构造一个文件
In [88]: pd.read_csv('test.csv')
Out[88]:
Unnamed: 0 one two
0 a 1.0 4.0
1 b 2.0 5.0
2 c 3.0 6.0
3 d 4.0 7.0
4 e NaN 10.0
5 f NaN NaN
In [89]: pd.read_csv('test.csv',index_col=0) #可以通过列的索引值来指定行标签
Out[89]:
one two
a 1.0 4.0
b 2.0 5.0
c 3.0 6.0
d 4.0 7.0
e NaN 10.0
f NaN NaN
In [90]: pd.read_csv('test.csv',index_col='one') #可以通过列名来指定行标签
Out[90]:
Unnamed: 0 two
one
1.0 a 4.0
2.0 b 5.0
3.0 c 6.0
4.0 d 7.0
NaN e 10.0
NaN f NaN
如果把时间按上述方式读进来,还有个问题,就是时间读进行,虽然做了索引,并不是一个时间对象
只是一个字符串,怎么转化成时间对象呢?
- parse_dates 指定某些列是否被解析为日期,类型为布尔或列表
pd.read_csv('test.csv',index_col='date',parse_dates=True) #对表里的所有的能解析成时间序列都解析
pd.read_csv('test.csv',index_col='date',parse_dates=['date']) #对这一列进行时间解析
- header=None 指定文件无列名
- names 指定列名,传列表
如果不存在列名这行,数据获取时,会以第一行的数据为列名,如果要指定,可以如下操作
In [106]: pd.read_csv('test.csv')
Out[106]:
1.0 4.0
0 2.0 5.0
1 3.0 6.0
2 4.0 7.0
3 NaN 10.0
4 NaN NaN
In [107]: pd.read_csv('test.csv',header=None) #告诉解析器说数据不带列名,同时也是不把第一行数据作为列名,列名默认为从0开始的数字
Out[107]:
0 1
0 1.0 4.0
1 2.0 5.0
2 3.0 6.0
3 4.0 7.0
4 NaN 10.0
5 NaN NaN
In [108]: pd.read_csv('test.csv',header=None,names=list('gh')) #指定列名gk,传列表
Out[108]:
g h
0 1.0 4.0
1 2.0 5.0
2 3.0 6.0
3 4.0 7.0
4 NaN 10.0
5 NaN NaN
- na_values 指定某个值,或者说某个字符串表示缺失值(NaN)
- skiprows 指定跳过某些行
In [110]: pd.read_csv('test.csv',header=None,names=list('gh'),na_values='10')
Out[110]:
g h
0 1.0 4.0
1 2.0 5.0
2 3.0 6.0
3 4.0 7.0
4 NaN NaN
5 NaN NaN
In [111]: pd.read_csv('test.csv',header=None,names=list('gh'),na_values='10',skiprows=[4,5])
Out[111]:
g h
0 1.0 4.0
1 2.0 5.0
2 3.0 6.0
3 4.0 7.0
写入操作:to_csv函数
- sep 指定文件分隔符
- na_rep 指定缺失值转换的字符串,默认为空字符串
- header=False 不输出列名一行
- index=False 不输出行索引一列
- cols 指定输入的列,传入列表
In [59]: df.to_csv('test3.csv',header=False,index=False,na_rep='null',encoding='gbk',columns=['年份','
...: 股票代码','股票价格'])
In [60]: pd.read_csv('test3.csv',encoding='gbk')
pandas支持的其他文件类型:json,XML,HTML,数据库,pickle,excel...
In [68]: df.to_html('test.html',header=False,index=False,na_rep='null',columns=['年份','股票代码','
...: 股票价格'])
In [5]: pd.read_html('test.html',encoding='gbk') #读这些文件类型都要安装另外的模块
来源:oschina
链接:https://my.oschina.net/u/4281713/blog/3786409