学习内容
pandas的基本操作
重点
1.更改数组的索引,并对新索对应内赋值
import numpy as np
import pandas as pd
dates = pd.date_range("20190124", periods=6)
df = pd.DataFrame(np.arange(24).reshape(6, 4), index=dates,
columns=['A', 'B', 'C', 'D'])
df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ["E"])
print(df)
print(df1)
df1.loc[dates[1:3], "E"] = 2
print(df1)
结果如下
A B C D
2020-01-24 0 1 2 3
2020-01-25 4 5 6 7
2020-01-26 8 9 10 11
2020-01-27 12 13 14 15
2020-01-28 16 17 18 19
2020-01-29 20 21 22 23
A B C D E
2020-01-24 0.0 1.0 2.0 3.0 NaN
2020-01-25 4.0 5.0 6.0 7.0 NaN
2020-01-26 8.0 9.0 10.0 11.0 NaN
2020-01-27 12.0 13.0 14.0 15.0 NaN
A B C D E
2020-01-24 0.0 1.0 2.0 3.0 NaN
2020-01-25 4.0 5.0 6.0 7.0 2.0
2020-01-26 8.0 9.0 10.0 11.0 2.0
2020-01-27 12.0 13.0 14.0 15.0 NaN
2.判断数组中是否有空数据并对其进行删除或者附上默认值
print(pd.isnull(df1).any())
A False
B False
C False
D False
E True
dtype: bool
print(pd.isnull(df1))
A B C D E
2020-01-24 False False False False True
2020-01-25 False False False False False
2020-01-26 False False False False False
2020-01-27 False False False False True
df2 = df1.dropna()
print(df2)
A B C D E
2020-01-25 4.0 5.0 6.0 7.0 2.0
2020-01-26 8.0 9.0 10.0 11.0 2.0
df3 = df1.fillna(value=5.0)
print(df3)
A B C D E
2020-01-24 0.0 1.0 2.0 3.0 5.0
2020-01-25 4.0 5.0 6.0 7.0 2.0
2020-01-26 8.0 9.0 10.0 11.0 2.0
2020-01-27 12.0 13.0 14.0 15.0 5.0
!NAN(空值)不进行计算
k = pd.Series([1, 3, 5, np.nan, 6, 8], index=dates).shift(2)#创建一个序列
print(df.sub(k, axis=0))
A B C D
2020-01-24 NaN NaN NaN NaN
2020-01-25 NaN NaN NaN NaN
2020-01-26 7.0 8.0 9.0 10.0
2020-01-27 9.0 10.0 11.0 12.0
2020-01-28 11.0 12.0 13.0 14.0
2020-01-29 NaN NaN NaN NaN
3.apply函数的应用
我们可以通过定义函数来操作数组
print(df.apply((lambda x:x.max()-x.min()),axis=1))
2020-01-24 3
2020-01-25 3
2020-01-26 3
2020-01-27 3
2020-01-28 3
2020-01-29 3
print(df.apply((lambda x:x.max()-x.min()),axis=0))
A 20
B 20
C 20
D 20
4.数组的拼接
k1 = pd.DataFrame(np.random.randn(10, 4), columns=list("ABCD"))
print(k1)#创建数组
A B C D
0 0.992862 -1.055113 -0.974825 -0.277429
1 -2.192020 0.010336 -0.359606 -0.396194
2 0.127347 1.635541 0.229802 0.451141
3 -0.027055 -0.416263 -0.076526 0.130186
4 1.430352 0.115233 -0.906570 -0.137904
5 1.248494 -0.443819 2.787698 0.276130
6 0.055125 0.546643 0.741180 0.915405
7 -0.778035 0.413182 1.730721 1.425731
8 0.750959 -1.330331 0.137999 -1.399205
9 -0.946425 -0.964356 -0.821974 -0.261646
k2 = pd.concat(([k1.iloc[:3], k1.iloc[3:7], k1.iloc[7:]]))
print(k2)#拼接
A B C D
0 0.992862 -1.055113 -0.974825 -0.277429
1 -2.192020 0.010336 -0.359606 -0.396194
2 0.127347 1.635541 0.229802 0.451141
3 -0.027055 -0.416263 -0.076526 0.130186
4 1.430352 0.115233 -0.906570 -0.137904
5 1.248494 -0.443819 2.787698 0.276130
6 0.055125 0.546643 0.741180 0.915405
7 -0.778035 0.413182 1.730721 1.425731
8 0.750959 -1.330331 0.137999 -1.399205
9 -0.946425 -0.964356 -0.821974 -0.261646
对比发现我们创建的数组,在拼接之后内容不变!
5.数组的合并
b1 = pd.DataFrame({'key': ["foo", "foo"], "level": [1, 2]})
b2 = pd.DataFrame({'key': ["foo", "foo"], "levels": [3, 2]})
print(b1)
print(b2)
b3 = pd.merge(b1, b2, on="key")#通过key来拼接
print(b3)
key level
0 foo 1
1 foo 2
key levels
0 foo 3
1 foo 2
key level levels
0 foo 1 3
1 foo 1 2
2 foo 2 3
3 foo 2 2
7.插入数据
dates = pd.date_range("20200124", periods=6)
k1 = pd.DataFrame(np.random.randn(10, 4), columns=list("ABCD"))
df = pd.DataFrame(np.arange(24).reshape(6, 4), index=dates,
columns=['A', 'B', 'C', 'D'])
print(df.append(k1, ignore_index=True))#append语句操作
A B C D
0 0.000000 1.000000 2.000000 3.000000
1 4.000000 5.000000 6.000000 7.000000
2 8.000000 9.000000 10.000000 11.000000
3 12.000000 13.000000 14.000000 15.000000
4 16.000000 17.000000 18.000000 19.000000
5 20.000000 21.000000 22.000000 23.000000
6 0.513074 -0.019724 0.266624 -0.660174
7 -1.095335 -1.778028 1.012710 -2.805666
8 0.068861 -0.530661 0.377946 -0.380027
9 -1.195802 -0.502530 -0.270067 -0.329765
10 -0.746135 0.053221 0.813126 -0.003984
11 0.101050 1.130641 -0.540327 -1.511843
12 -0.543655 -1.849691 0.970787 -0.710726
13 0.657702 1.031124 -0.391400 1.630099
14 0.917138 2.269298 -1.821373 -0.996168
15 0.069104 0.228138 -0.272084 0.776543
print(df.append(k1))#不加上ignore_index=Ture,这样索引就是用自己的
A B C D
2020-01-24 00:00:00 0.000000 1.000000 2.000000 3.000000
2020-01-25 00:00:00 4.000000 5.000000 6.000000 7.000000
2020-01-26 00:00:00 8.000000 9.000000 10.000000 11.000000
2020-01-27 00:00:00 12.000000 13.000000 14.000000 15.000000
2020-01-28 00:00:00 16.000000 17.000000 18.000000 19.000000
2020-01-29 00:00:00 20.000000 21.000000 22.000000 23.000000
0 1.169509 -0.639238 0.802631 1.222766
1 0.256757 0.059154 1.374081 -1.899945
2 -1.640777 -0.563769 0.889861 1.162652
3 -1.501253 -0.111475 -0.816034 0.550832
4 0.671567 0.769691 -1.315246 0.464230
5 1.468028 0.996928 -0.340389 -0.340204
6 0.995253 2.082980 -2.283605 -0.264507
7 -1.037400 -0.376944 0.458426 -0.650974
8 -2.138595 2.230048 1.843457 -0.175943
9 1.313964 -0.154494 -0.960270 -0.792406
8.按条件分组
print(df.groupby("A").sum())#按A分组在求和
B C D
A
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
16 17 18 19
20 21 22 23
print(df.groupby(["A","B"]).sum())#按照A,B分组,注意AB是列表内容,区分单个分组所,这里需要中括号
C D
A B
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
16 17 18 19
20 21 22 23
来源:CSDN
作者:☺����
链接:https://blog.csdn.net/soulproficiency/article/details/104080880