pandas | 易学教程

　　pandas是一个强大的python数据分析的工具包，是基于NumPy构建的

　　主要功能：

具备对其功能的数据结构DataFrame Series
集成时间序列功能
提供丰富的数学运算和操作
灵活处理缺失数据

　　安装：pip install pandas

　　引用：import pandas as pd

Series-一维数据对象

　　Series是一种类似于一维数组的对象，由一组数据和一组与之相关的数据标签(索引)组成

　　创建方式

In [206]: import pandas as pd

In [207]: pd.Series([4,7,-5,3])
Out[207]: 
0    4
1    7
2   -5
3    3
dtype: int64

In [208]: pd.Series([4,7,-5,3], index=['a','b','c','d'])
Out[208]: 
a    4
b    7
c   -5
d    3
dtype: int64

In [209]: pd.Series({'a':1,'b':2})
Out[209]: 
a    1
b    2
dtype: int64

In [210]: pd.Series(0, index=['a','b','c','d'])
Out[210]: 
a    0
b    0
c    0
d    0
dtype: int64

　　获取值数组和索引数组： values属性和index属性

In [211]: a = pd.Series([4,7,-5,3], index=['a','b','c','d'])

In [212]: a.values
Out[212]: array([ 4,  7, -5,  3], dtype=int64)

In [214]: a.index
Out[214]: Index(['a', 'b', 'c', 'd'], dtype='object')

　　Series比较像列表(数组)和字典的结合体

Series-使用特性

　　Series支持array的特性

与标量运算 sr*2

In [217]: sr
Out[217]: 
a    4
b    7
c   -5
d    3
dtype: int64

In [218]: sr * 2
Out[218]: 
a     8
b    14
c   -10
d     6
dtype: int64

与变量运算 sr1+sr2 标签一致的情况下，数值才会相加，否则会增加标签

In [221]: sr2 = pd.Series([1,2,3,4],index=['a','b','c','d'])

In [222]: sr + sr2
Out[222]: 
a    5
b    9
c   -2
d    7
dtype: int64

索引 sr[0],sr[[1,2,4]]

In [224]: sr
Out[224]: 
a    4
b    7
c   -5
d    3
dtype: int64

In [225]: sr[0]
Out[225]: 4

In [226]: sr[[0,2,3]]
Out[226]: 
a    4
c   -5
d    3
dtype: int64

切片 sr[:2]

In [227]: sr[:2]
Out[227]: 
a    4
b    7
dtype: int64

通用函数 np.abs(sr)

In [228]: sr
Out[228]: 
a    4
b    7
c   -5
d    3
dtype: int64

In [229]: np.abs(sr)
Out[229]: 
a    4
b    7
c    5
d    3
dtype: int64

布尔值过滤

In [230]: sr
Out[230]: 
a    4
b    7
c   -5
d    3
dtype: int64

In [231]: sr[sr>0]
Out[231]: 
a    4
b    7
d    3
dtype: int64

　　Series支持字典的特性(标签)

从字典创建Series Series(dic)

In [232]: pd.Series({'a':1,'b':5})
Out[232]: 
a    1
b    5
dtype: int64

标签in运算判断 ‘a’ in sr，循环时，默认循环值

In [233]: sr
Out[233]: 
a    4
b    7
c   -5
d    3
dtype: int64

In [234]: 'a' in sr
Out[234]: True

In [236]: for i in sr:
     ...:     print(i)
4
7
-5
3

键索引 sr['a'], sr[['a','b','d']]

In [237]: sr
Out[237]: 
a    4
b    7
c   -5
d    3
dtype: int64

In [238]: sr['a']
Out[238]: 4

In [239]: sr[['a','c']]
Out[239]: 
a    4
c   -5
dtype: int64

In [240]: sr['a':'c']  #起始值和终点值都能取到
Out[240]: 
a    4
b    7
c   -5
dtype: int64

Series-整数索引

　　如果series对象里的键为整数时，就存在键取值和索引取值搞混的问题，默认是键取值

In [40]: sr = pd.Series(np.arange(10))

In [41]: sr
Out[41]:
0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
dtype: int32


In [44]: sr2 = sr[5:].copy()

In [45]: sr2
Out[45]:
5    5
6    6
7    7
8    8
9    9
dtype: int32

In [46]: sr2[5]  #取键为5的值，如果是索引，肯定报错
Out[46]: 5

In [47]: sr2[-1]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-47-3882bebf0859> in <module>()
----> 1 sr2[-1]

C:\python server\anaconda\lib\site-packages\pandas\core\series.py in __getitem__(self, key)
    621         key = com._apply_if_callable(key, self)
    622         try:
--> 623             result = self.index.get_value(self, key)
    624
    625             if not is_scalar(result):

C:\python server\anaconda\lib\site-packages\pandas\core\indexes\base.py in get_value(self, series, key)
   2558         try:
   2559             return self._engine.get_value(s, k,
-> 2560                                           tz=getattr(series.dtype, 'tz', None))
   2561         except KeyError as e1:
   2562             if len(self) > 0 and self.inferred_type in ['integer', 'boolean']:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()

KeyError: -1

　　解决方法：指定取值方法，loc标签取值，iloc就是索引取值

In [48]: sr2.loc[5]   #ioc就是 标签 或 键取值
Out[48]: 5

In [50]: sr2.iloc[4]  #iloc就下标  或 索引取值
Out[50]: 9

In [51]: sr2.iloc[-1]
Out[51]: 9

Series数据对齐

　　pandas在进行两个Series对象的运算时，会按索引进行对齐然后计算

　　如果两个Series对象的索引不完全相同，则结果的索引是两个操作数索引的并集

　　如果只有一个对象在某个索引下有值，则结果中该索引的值为nan(缺失值)

In [6]: sr1 = pd.Series([4,9,100],index=['a','b','c'])

In [7]: sr1
Out[7]:
a      4
b      9
c    100
dtype: int64

In [8]: sr2 = pd.Series([4,5,6],index=['b','c','d'])

In [9]: sr2
Out[9]:
b    4
c    5
d    6
dtype: int64

In [10]: sr1 + sr2
Out[10]:
a      NaN
b     13.0
c    105.0
d      NaN
dtype: float64

　　如果处理缺失值呢，比如上面的，a处理为4，另外一组没有的处理成0

　　你可以用灵活的算术方法：add，sub，div，mul

　　sr1 + sr2 等同 sr1.add(sr2),利用函数的填充参数fill_value处理缺失值

In [10]: sr1 + sr2
Out[10]:
a      NaN
b     13.0
c    105.0
d      NaN
dtype: float64

In [11]: sr1.add(sr2)
Out[11]:
a      NaN
b     13.0
c    105.0
d      NaN
dtype: float64

In [12]: sr1.add(sr2,fill_value=0)
Out[12]:
a      4.0
b     13.0
c    105.0
d      6.0
dtype: float64

Series-缺失数据

　　缺失数据：使用NaN(Not a Number)来表示缺失数据，其值等于np.nan，内置的None值也会被当做NaN处理

　　提供这么几个方法帮助我们处理缺失值

dropna() 过滤掉值为NaN的行
fillna 填充缺失数据
isnull 返回布尔数组，缺失值对应为True
notnull 返回布尔数组，缺失值对为False

　　第一种方式：扔掉缺失值

In [61]: sr = sr1 + sr2

In [62]: sr
Out[62]:
a    33.0
b     NaN
c    32.0
d    45.0
dtype: float64

In [63]: sr.isnull()  #判断是否为nan值
Out[63]:
a    False
b     True
c    False
d    False
dtype: bool

In [64]: sr.notnull()  #不是nan值，那就可以通过这个，结合series的过滤掉缺失值
Out[64]:
a     True
b    False
c     True
d     True
dtype: bool

In [65]: sr[sr.notnull()] #过滤
Out[65]:
a    33.0
c    32.0
d    45.0
dtype: float64

In [66]: sr.dropna()  #series对象里本身就提供一个扔掉nan值的方法
Out[66]:
a    33.0
c    32.0
d    45.0
dtype: float64

　　第二种处理缺失值的方式：填充

In [68]: sr.fillna(0)  #填充0，还可以填充均值sr.mean()  均值函数会跳过nan值
Out[68]:
a    33.0
b     0.0
c    32.0
d    45.0
dtype: float64

In [69]: sr = sr.fillna(0)  #由于不会对已有对象进行修改，需要重新赋值

DataFrame-二维数据对象

　　DataFrame是一个表格型的数据结构，含有一组有序的列，DataFrame可以被看做是由Series组成的字典，并且共用一个索引

　　创建方式

In [3]: pd.DataFrame({'one':[1,2,3,4], 'two':[4,3,2,1]})
Out[3]: 
   one  two
0    1    4
1    2    3
2    3    2
3    4    1

In [4]: pd.DataFrame({'one': pd.Series([1,2,3], index=['a','b','c']), 'two': pd.Series([1,2,3,4],index=['b','a','c','d'])})
Out[4]: 
   one  two
a  1.0    2
b  2.0    1
c  3.0    3
d  NaN    4  #如果有缺失值，就以nan值返回

In [5]: pd.DataFrame({'one':[1,2,3,4], 'two':[4,3,2,1]}, index=['a','b','c','d'])
Out[5]: 
   one  two
a    1    4
b    2    3
c    3    2
d    4    1

DataFrame-常用属性

index 获取索引(行名)
values 获取值数组
columns 获取列索引(列名)

In [5]: pd.DataFrame({'one':[1,2,3,4], 'two':[4,3,2,1]}, index=['a','b','c','d'])
Out[5]: 
   one  two
a    1    4
b    2    3
c    3    2
d    4    1

In [6]: df = _5

In [7]: df.index
Out[7]: Index(['a', 'b', 'c', 'd'], dtype='object')

In [8]: df.values
Out[8]: 
array([[1, 4],
       [2, 3],
       [3, 2],
       [4, 1]], dtype=int64)

In [9]: df.columns
Out[9]: Index(['one', 'two'], dtype='object')

T 转置，行列对换

In [10]: df
Out[10]: 
   one  two
a    1    4
b    2    3
c    3    2
d    4    1

In [11]: df.T
Out[11]: 
     a  b  c  d
one  1  2  3  4
two  4  3  2  1

describe() 获取快速统计，主要统计每列中个数，平均数，最大，最小，标准差，中位数等

In [13]: df.describe()  #对列进行统计
Out[13]:
            one       two
count  4.000000  4.000000  #个数（nan不包括）
mean   2.500000  5.500000  #均值
std    1.290994  1.290994  #标准差
min    1.000000  4.000000  #最小
25%    1.750000  4.750000  
50%    2.500000  5.500000  #中位数
75%    3.250000  6.250000
max    4.000000  7.000000  #最大

DataFrame-索引和切片

　　DataFrame是一个二维数据类型，所以有行索引和列索引

　　DataFrame同样可以通过标签和位置两种方式来进行索引和切片

　　列表索引方式

　　获取时先列后行，支持只获取列，但是不支持只获取行

In [18]: df
Out[18]: 
   one  two
a    1    4
b    2    3
c    3    2
d    4    1

InIn [19]: 


In [19]: df['one']['a']  #中括号取值，先列后行
Out[19]: 1

In [20]: df['one']  #单取列
Out[20]: 
a    1
b    2
c    3
d    4
Name: one, dtype: int64

In [21]: df['a']  #报错  #中括号取值方式  不支持只取行，因为行不是Series对象，列才是

　　行列索引方式

loc属性标签获取方式(行名和列名获取)
iloc属性索引获取方式

　　使用方法：逗号隔开，前为行索引，后为列索引

　　行/列索引部分可以是常规索引、切片、布尔值索引、花式索引任意搭配

In [22]: df
Out[22]: 
   one  two
a    1    4
b    2    3
c    3    2
d    4    1

In [23]: df.loc['a',] #loc方式就支持光取行
Out[23]: 
one    1
two    4
Name: a, dtype: int64

In [24]: df.loc[['a','c'],] #支持花式索引
Out[24]: 
   one  two
a    1    4
c    3    2

In [25]: df.loc['a':'c','one']
Out[25]: 
a    1
b    2
c    3
Name: one, dtype: int64

In [26]: df.iloc[0] #iloc方式就支持光取行
Out[26]: 
one    1
two    4
Name: a, dtype: int64

In [27]: df.iloc[0][1]
Out[27]: 4

In [28]: df.iloc[0,1]
Out[28]: 4

DataFrame-数据对齐与缺失数据

　　DataFrame对象在运算时，同样会进行数据对齐，其行索引和列索引分别对齐

In [29]: pd.DataFrame({'one':[1,2,3,4],'two':[4,5,6,7]}, index=['a','b','c','d'])
Out[29]: 
   one  two
a    1    4
b    2    5
c    3    6
d    4    7

In [30]: df = _29

In [31]: df2 = pd.DataFrame({'two':[7,8,7,8],'one':[8,9,8,8]}, index=['a','c','d','b'])

In [32]: df2
Out[32]: 
   two  one
a    7    8
c    8    9
d    7    8
b    8    8

In [33]: df + df2
Out[33]: 
   one  two
a    9   11
b   10   13
c   12   14
d   12   14

　　缺失值处理方式一：填充 fillna()

In [35]: df.loc['e', 'one'] = np.nan

In [36]: df.loc['e', 'two'] = 10

In [37]: df.loc['f', 'one'] = np.nan

In [38]: df.loc['f', 'two'] = np.nan

In [39]: df
Out[39]: 
   one   two
a  1.0   4.0
b  2.0   5.0
c  3.0   6.0
d  4.0   7.0
e  NaN  10.0
f  NaN   NaN

In [40]: df.fillna(0)
Out[40]: 
   one   two
a  1.0   4.0
b  2.0   5.0
c  3.0   6.0
d  4.0   7.0
e  0.0  10.0
f  0.0   0.0

　　缺失值处理方式二：扔掉

dropna() axis指定操作删除对象类型是行还是列，默认为0就是行，1为列 where指定什么情况下删除，any表示有nan就删除，而all表示行或列中都为nan删除

In [39]: df2.dropna()  #默认是how=any
Out[39]:
   one  two
a  8.0  7.0
c  9.0  8.0
d  8.0  7.0
b  8.0  8.0

In [40]: df2.dropna(how='all')  #删除所有列都为nan的行
Out[40]:
   one   two
a  8.0   7.0
c  9.0   8.0
d  8.0   7.0
b  8.0   8.0
e  NaN  10.0

In [41]: df2.dropna(how='any')  #删除含nan值的行
Out[41]:
   one  two
a  8.0  7.0
c  9.0  8.0
d  8.0  7.0
b  8.0  8.0

In [42]: df.loc['a','one'] = np.nan

In [43]: df
Out[43]:
   one  two
a  NaN    4
b  2.0    5
c  3.0    6
d  4.0    7

In [44]: df.dropna(axis=1)  #删除含nan值的列
Out[44]:
   two
a    4
b    5
c    6
d    7

isnull()
notnull()

pandas-常用方法

mean(axis=0, skipna=True) 对列(行)求平均值，默认0为列
sum(axis=1) 对列(行)求和

In [45]: df
Out[45]:
   one  two
a  NaN    4
b  2.0    5
c  3.0    6
d  4.0    7

In [46]: df.mean()  #默认对列求均值
Out[46]:
one    3.0
two    5.5
dtype: float64

In [47]: df.mean(axis=1)  #对行求均值
Out[47]:
a    4.0
b    3.5
c    4.5
d    5.5
dtype: float64

In [48]: df.sum()  #对列求和
Out[48]:
one     9.0
two    22.0
dtype: float64

sort_index(axis=0,ascending=True) 对列(行)索引排序，ascending为True时，为升序，False为降序
sort_values(by,axis=0,ascending=True) 对列(行)的值排序 by为哪一列或哪一行

In [49]: df.sort_values(by='two')  #对某列值进行升序
Out[49]:
   one  two
a  NaN    4
b  2.0    5
c  3.0    6
d  4.0    7

In [50]: df.sort_values(by='two',ascending=False)  #对某列进行降序
Out[50]:
   one  two
d  4.0    7
c  3.0    6
b  2.0    5
a  NaN    4

In [52]: df.sort_values(by='a',ascending=False,axis=1)  #对某行进行降序
Out[52]:
   two  one
a    4  NaN
b    5  2.0
c    6  3.0
d    7  4.0

In [53]: df.sort_values(by='one')  #nan值不参与排序，放到最后
Out[53]:
   one  two
b  2.0    5
c  3.0    6
d  4.0    7
a  NaN    4

In [54]: df.sort_values(by='one',ascending=False)
Out[54]:
   one  two
d  4.0    7
c  3.0    6
b  2.0    5
a  NaN    4

In [55]: df.sort_index()  #按行升序
Out[55]:
   one  two
a  NaN    4
b  2.0    5
c  3.0    6
d  4.0    7

In [56]: df.sort_index(ascending=False)  #按行降序
Out[56]:
   one  two
d  4.0    7
c  3.0    6
b  2.0    5
a  NaN    4

In [57]: df.sort_index(axis=1)  #按列排
Out[57]:
   one  two
a  NaN    4
b  2.0    5
c  3.0    6
d  4.0    7

In [58]: df.sort_index(ascending=False,axis=1)
Out[58]:
   two  one
a    4  NaN
b    5  2.0
c    6  3.0
d    7  4.0

　　其他

apply(func, axis=0) 将自定义函数应用在各行或者各列上，func可返回标量或者Series
applymap(func) 将函数应用在DataFrame各个元素上
map(func) 将函数应用在Series各个元素上

pandas-时间对象处理

　　生成时间对象数组：date_range

start 开始时间
end 结束时间
periods 时间长度
freq 时间频率，默认为'D', 可选H(our) W(eek) B(usiness) S(emi-) M(onth) (min)T(es) S(econd), A(year)

In [71]: pd.date_range('2018-01-01',periods=10)
Out[71]: 
DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
               '2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08',
               '2018-01-09', '2018-01-10'],
              dtype='datetime64[ns]', freq='D')

In [72]: pd.date_range('2018-01-01','2030-01-01',freq='A')
Out[72]: 
DatetimeIndex(['2018-12-31', '2019-12-31', '2020-12-31', '2021-12-31',
               '2022-12-31', '2023-12-31', '2024-12-31', '2025-12-31',
               '2026-12-31', '2027-12-31', '2028-12-31', '2029-12-31'],
              dtype='datetime64[ns]', freq='A-DEC')

　　时间序列就是以时间对象为索引的Series或DataFrame

　　datetime对象作为索引时是存储在DatetimeIndex对象中的

In [73]: sr = pd.Series(np.arange(20), index=pd.date_range('2018-01-01', periods=20))

In [74]: sr
Out[74]: 
2018-01-01     0
2018-01-02     1
2018-01-03     2
2018-01-04     3
2018-01-05     4
2018-01-06     5
2018-01-07     6
2018-01-08     7
2018-01-09     8
2018-01-10     9
2018-01-11    10
2018-01-12    11
2018-01-13    12
2018-01-14    13
2018-01-15    14
2018-01-16    15
2018-01-17    16
2018-01-18    17
2018-01-19    18
2018-01-20    19
Freq: D, dtype: int32

In [75]: sr.index
Out[75]: 
DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
               '2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08',
               '2018-01-09', '2018-01-10', '2018-01-11', '2018-01-12',
               '2018-01-13', '2018-01-14', '2018-01-15', '2018-01-16',
               '2018-01-17', '2018-01-18', '2018-01-19', '2018-01-20'],
              dtype='datetime64[ns]', freq='D')

　　时间序列特殊功能：

传入'年'或'年月'作为切片方式

In [32]: sr = pd.Series(np.arange(1000),index=pd.date_range('2018-01-01',periods=1000))

In [33]: sr['2018-03']  #切某年的某个月
Out[33]:
2018-03-01    59
2018-03-02    60
2018-03-03    61
2018-03-04    62
2018-03-05    63
2018-03-06    64
2018-03-07    65
2018-03-08    66
2018-03-09    67
2018-03-10    68
2018-03-11    69
2018-03-12    70
2018-03-13    71
2018-03-14    72
2018-03-15    73
2018-03-16    74
2018-03-17    75
2018-03-18    76
2018-03-19    77
2018-03-20    78
2018-03-21    79
2018-03-22    80
2018-03-23    81
2018-03-24    82
2018-03-25    83
2018-03-26    84
2018-03-27    85
2018-03-28    86
2018-03-29    87
2018-03-30    88
2018-03-31    89
Freq: D, dtype: int32

In [35]: sr['2019']  #切某年
Out[35]:
2019-01-01    365
2019-01-02    366
2019-01-03    367
2019-01-04    368
2019-01-05    369
2019-01-06    370
2019-01-07    371
2019-01-08    372
2019-01-09    373
2019-01-10    374
2019-01-11    375
2019-01-12    376
2019-01-13    377
2019-01-14    378
2019-01-15    379
2019-01-16    380
2019-01-17    381
2019-01-18    382
2019-01-19    383
2019-01-20    384
2019-01-21    385
2019-01-22    386
2019-01-23    387
2019-01-24    388
2019-01-25    389
2019-01-26    390
2019-01-27    391
2019-01-28    392
2019-01-29    393
2019-01-30    394
             ...
2019-12-02    700
2019-12-03    701
2019-12-04    702
2019-12-05    703
2019-12-06    704
2019-12-07    705
2019-12-08    706
2019-12-09    707
2019-12-10    708
2019-12-11    709
2019-12-12    710
2019-12-13    711
2019-12-14    712
2019-12-15    713
2019-12-16    714
2019-12-17    715
2019-12-18    716
2019-12-19    717
2019-12-20    718
2019-12-21    719
2019-12-22    720
2019-12-23    721
2019-12-24    722
2019-12-25    723
2019-12-26    724
2019-12-27    725
2019-12-28    726
2019-12-29    727
2019-12-30    728
2019-12-31    729
Freq: D, Length: 365, dtype: int32

传入日期范围作为切片方式

In [36]: sr['2018-11':'2019-01']  #按年月切片
Out[36]:
2018-11-01    304
2018-11-02    305
2018-11-03    306
2018-11-04    307
2018-11-05    308
2018-11-06    309
2018-11-07    310
2018-11-08    311
2018-11-09    312
2018-11-10    313
2018-11-11    314
2018-11-12    315
2018-11-13    316
2018-11-14    317
2018-11-15    318
2018-11-16    319
2018-11-17    320
2018-11-18    321
2018-11-19    322
2018-11-20    323
2018-11-21    324
2018-11-22    325
2018-11-23    326
2018-11-24    327
2018-11-25    328
2018-11-26    329
2018-11-27    330
2018-11-28    331
2018-11-29    332
2018-11-30    333
             ...
2019-01-02    366
2019-01-03    367
2019-01-04    368
2019-01-05    369
2019-01-06    370
2019-01-07    371
2019-01-08    372
2019-01-09    373
2019-01-10    374
2019-01-11    375
2019-01-12    376
2019-01-13    377
2019-01-14    378
2019-01-15    379
2019-01-16    380
2019-01-17    381
2019-01-18    382
2019-01-19    383
2019-01-20    384
2019-01-21    385
2019-01-22    386
2019-01-23    387
2019-01-24    388
2019-01-25    389
2019-01-26    390
2019-01-27    391
2019-01-28    392
2019-01-29    393
2019-01-30    394
2019-01-31    395
Freq: D, Length: 92, dtype: int32

In [37]: sr['2018-12-03':'2019-01-01']  #按日期切片
Out[37]:
2018-12-03    336
2018-12-04    337
2018-12-05    338
2018-12-06    339
2018-12-07    340
2018-12-08    341
2018-12-09    342
2018-12-10    343
2018-12-11    344
2018-12-12    345
2018-12-13    346
2018-12-14    347
2018-12-15    348
2018-12-16    349
2018-12-17    350
2018-12-18    351
2018-12-19    352
2018-12-20    353
2018-12-21    354
2018-12-22    355
2018-12-23    356
2018-12-24    357
2018-12-25    358
2018-12-26    359
2018-12-27    360
2018-12-28    361
2018-12-29    362
2018-12-30    363
2018-12-31    364
2019-01-01    365
Freq: D, dtype: int32

丰富的函数支持：resample(),strftime()

In [38]: sr.resample('W').sum()  #按周求和
Out[38]:
2018-01-07      21
2018-01-14      70
2018-01-21     119
2018-01-28     168
2018-02-04     217
2018-02-11     266
2018-02-18     315
2018-02-25     364
2018-03-04     413
2018-03-11     462
2018-03-18     511
2018-03-25     560
2018-04-01     609
2018-04-08     658
2018-04-15     707
2018-04-22     756
2018-04-29     805
2018-05-06     854
2018-05-13     903
2018-05-20     952
2018-05-27    1001
2018-06-03    1050
2018-06-10    1099
2018-06-17    1148
2018-06-24    1197
2018-07-01    1246
2018-07-08    1295
2018-07-15    1344
2018-07-22    1393
2018-07-29    1442
              ...
2020-03-08    5558
2020-03-15    5607
2020-03-22    5656
2020-03-29    5705
2020-04-05    5754
2020-04-12    5803
2020-04-19    5852
2020-04-26    5901
2020-05-03    5950
2020-05-10    5999
2020-05-17    6048
2020-05-24    6097
2020-05-31    6146
2020-06-07    6195
2020-06-14    6244
2020-06-21    6293
2020-06-28    6342
2020-07-05    6391
2020-07-12    6440
2020-07-19    6489
2020-07-26    6538
2020-08-02    6587
2020-08-09    6636
2020-08-16    6685
2020-08-23    6734
2020-08-30    6783
2020-09-06    6832
2020-09-13    6881
2020-09-20    6930
2020-09-27    5979
Freq: W-SUN, Length: 143, dtype: int32

In [39]: sr.resample('A').sum()  #按年求和
Out[39]:
2018-12-31     66430
2019-12-31    199655
2020-12-31    233415
Freq: A-DEC, dtype: int32

In [40]: sr.resample('M').mean()  #按月求平均值
Out[40]:
2018-01-31     15.0
2018-02-28     44.5
2018-03-31     74.0
2018-04-30    104.5
2018-05-31    135.0
2018-06-30    165.5
2018-07-31    196.0
2018-08-31    227.0
2018-09-30    257.5
2018-10-31    288.0
2018-11-30    318.5
2018-12-31    349.0
2019-01-31    380.0
2019-02-28    409.5
2019-03-31    439.0
2019-04-30    469.5
2019-05-31    500.0
2019-06-30    530.5
2019-07-31    561.0
2019-08-31    592.0
2019-09-30    622.5
2019-10-31    653.0
2019-11-30    683.5
2019-12-31    714.0
2020-01-31    745.0
2020-02-29    775.0
2020-03-31    805.0
2020-04-30    835.5
2020-05-31    866.0
2020-06-30    896.5
2020-07-31    927.0
2020-08-31    958.0
2020-09-30    986.5
Freq: M, dtype: float64

In [41]: sr.truncate(before='2019-11-12')  #截断日期之前的，因为切片能力非常强大，这个已经变的没什么意义了
Out[41]:
2019-11-12    680
2019-11-13    681
2019-11-14    682
2019-11-15    683
2019-11-16    684
2019-11-17    685
2019-11-18    686
2019-11-19    687
2019-11-20    688
2019-11-21    689
2019-11-22    690
2019-11-23    691
2019-11-24    692
2019-11-25    693
2019-11-26    694
2019-11-27    695
2019-11-28    696
2019-11-29    697
2019-11-30    698
2019-12-01    699
2019-12-02    700
2019-12-03    701
2019-12-04    702
2019-12-05    703
2019-12-06    704
2019-12-07    705
2019-12-08    706
2019-12-09    707
2019-12-10    708
2019-12-11    709
             ...
2020-08-28    970
2020-08-29    971
2020-08-30    972
2020-08-31    973
2020-09-01    974
2020-09-02    975
2020-09-03    976
2020-09-04    977
2020-09-05    978
2020-09-06    979
2020-09-07    980
2020-09-08    981
2020-09-09    982
2020-09-10    983
2020-09-11    984
2020-09-12    985
2020-09-13    986
2020-09-14    987
2020-09-15    988
2020-09-16    989
2020-09-17    990
2020-09-18    991
2020-09-19    992
2020-09-20    993
2020-09-21    994
2020-09-22    995
2020-09-23    996
2020-09-24    997
2020-09-25    998
2020-09-26    999
Freq: D, Length: 320, dtype: int32

pandas-文件处理

　　读取操作：read_csv

　　数据文件常用格式：csv(以某间隔符分割数据)

　　pandas读取文件：从文件名、URL、文件对象中加载数据

read_csv 默认分隔符为逗号
read_table 默认分隔符为制表符

　　参数解析

sep 指定分隔符，可用正则表达式如'\s+'
index_col 指定某列作为索引

In [87]: df.to_csv('test.csv',header=True,index=True,na_rep='null',encoding='gbk',columns=['one','two'])  #用DataFrame对象的方法构造一个文件

In [88]: pd.read_csv('test.csv')
Out[88]: 
  Unnamed: 0  one   two
0          a  1.0   4.0
1          b  2.0   5.0
2          c  3.0   6.0
3          d  4.0   7.0
4          e  NaN  10.0
5          f  NaN   NaN

In [89]: pd.read_csv('test.csv',index_col=0)  #可以通过列的索引值来指定行标签
Out[89]: 
   one   two
a  1.0   4.0
b  2.0   5.0
c  3.0   6.0
d  4.0   7.0
e  NaN  10.0
f  NaN   NaN

In [90]: pd.read_csv('test.csv',index_col='one') #可以通过列名来指定行标签
Out[90]: 
     Unnamed: 0   two
one
 1.0          a   4.0
 2.0          b   5.0
 3.0          c   6.0
 4.0          d   7.0
NaN           e  10.0
NaN           f   NaN

　　如果把时间按上述方式读进来，还有个问题，就是时间读进行，虽然做了索引，并不是一个时间对象
　　只是一个字符串，怎么转化成时间对象呢？

parse_dates 指定某些列是否被解析为日期，类型为布尔或列表

pd.read_csv('test.csv',index_col='date',parse_dates=True)  #对表里的所有的能解析成时间序列都解析
pd.read_csv('test.csv',index_col='date',parse_dates=['date'])  #对这一列进行时间解析

header=None 指定文件无列名
names 指定列名，传列表

　　如果不存在列名这行，数据获取时，会以第一行的数据为列名，如果要指定，可以如下操作

In [106]: pd.read_csv('test.csv')
Out[106]: 
   1.0   4.0
0  2.0   5.0
1  3.0   6.0
2  4.0   7.0
3  NaN  10.0
4  NaN   NaN

In [107]: pd.read_csv('test.csv',header=None)  #告诉解析器说数据不带列名，同时也是不把第一行数据作为列名，列名默认为从0开始的数字
Out[107]: 
     0     1
0  1.0   4.0
1  2.0   5.0
2  3.0   6.0
3  4.0   7.0
4  NaN  10.0
5  NaN   NaN

In [108]: pd.read_csv('test.csv',header=None,names=list('gh')) #指定列名gk，传列表
Out[108]: 
     g     h
0  1.0   4.0
1  2.0   5.0
2  3.0   6.0
3  4.0   7.0
4  NaN  10.0
5  NaN   NaN

na_values 指定某个值，或者说某个字符串表示缺失值(NaN)
skiprows 指定跳过某些行

In [110]: pd.read_csv('test.csv',header=None,names=list('gh'),na_values='10')
Out[110]: 
     g    h
0  1.0  4.0
1  2.0  5.0
2  3.0  6.0
3  4.0  7.0
4  NaN  NaN
5  NaN  NaN

In [111]: pd.read_csv('test.csv',header=None,names=list('gh'),na_values='10',skiprows=[4,5])
Out[111]: 
     g    h
0  1.0  4.0
1  2.0  5.0
2  3.0  6.0
3  4.0  7.0

　　写入操作：to_csv函数

sep 指定文件分隔符
na_rep 指定缺失值转换的字符串，默认为空字符串
header=False 不输出列名一行
index=False 不输出行索引一列
cols 指定输入的列，传入列表

In [59]: df.to_csv('test3.csv',header=False,index=False,na_rep='null',encoding='gbk',columns=['年份','
    ...: 股票代码','股票价格'])

In [60]: pd.read_csv('test3.csv',encoding='gbk')

　　pandas支持的其他文件类型：json，XML，HTML，数据库，pickle，excel...

In [68]: df.to_html('test.html',header=False,index=False,na_rep='null',columns=['年份','股票代码','
    ...: 股票价格'])

In [5]: pd.read_html('test.html',encoding='gbk')  #读这些文件类型都要安装另外的模块

来源：oschina

链接：https://my.oschina.net/u/4281713/blog/3786409

标签

Apache Axis