这篇博客利用了
pandas
对数据像sql
一样去处理。
读取测试数据
import pandas as pd import numpy as np url = 'https://raw.github.com/pandas-dev/pandas/master/pandas/tests/data/tips.csv' tips = pd.read_csv(url) # 读取数据 tips.head()
测试数据的前5行如下:
SELECT(选择语句)
SQL语句:
SELECT total_bill, tip, smoker, time FROM tips LIMIT 5;
Python语句:
tips[['total_bill', 'tip', 'smoker', 'time']].head(5)
UPDATE(更新语句)
SQL语句:
UPDATE tips SET tip = tip*2 WHERE tip < 2;
Python语句:
tips.loc[tips['tip'] < 2, 'tip'] *= 2
DELETE(删除语句)
SQL语句:
DELETE FROM tips WHERE tip > 9;
Python语句:
tips = tips.loc[tips['tip'] <= 9]
WHERE (条件)
SQL语句:
SELECT * FROM tips WHERE time = 'Dinner' LIMIT 5;
Python语句:
tips[tips['time'] == 'Dinner'].head(5)
AND&OR
SQL语句:
SELECT * FROM tips WHERE time = 'Dinner' AND tip >5.00;
Python语句:
# pandas中用“&”表示and;用“|”表示or tips[(tips['time'] == 'Dinner') & (tips['tip'] > 5.00)]
SQL语句:
SELECT * FROM tips WHERE size >= 5 OR total_bill > 45;
Python语句:
# 选出size大于5或者total_bill大于45的 tips[(tips['size'] >=5 ) | (tips['total_bill'] > 45)]
GROUP BY (分组聚合)
在pandas中,使用类似命名的 groupby()
方法执行SQL的GROUP BY操作。 groupby()
通常是指我们要将数据集拆分为组,应用一些函数(通常是聚合),然后将组合在一起的过程。
常见的SQL操作将在整个数据集中获取每个组中的记录计数。 例如,一个查询让我们得到性别剩余的提示数:
SQL语句:
SELECT sex, count(*) FROM tips GROUP BY sex; /* Female 87 Male 157 */
Python语句:
# sql中的ocunt和pandas的count不一样,这里是size()达到我们的目的 tips.groupby('sex').size()
Python语句:
tips.groupby('sex').count()
Python语句:
# 对单独一列进行count tips.groupby('sex')['total_bill'].count()
SQL语句:
SELECT day, AVG(tip), COUNT(*) FROM tips GROUP BY day; /* Fri 2.734737 19 Sat 2.993103 87 Sun 3.255132 76 Thur 2.771452 62 */
也可以同时应用多种功能。 例如,假设我们希望看到技巧数量在星期几之间有所差异,那么 agg()
可以让您将一个字典传递到您分组的 DataFrame
,指示哪些功能适用于特定的列。
Python语句:
tips.groupby('day').agg({'tip':np.mean, 'day':np.size})
按多列分组
SQL语句:
SELECT smoker, day, COUNT(*), AVG(tip) FROM tips GROUP BY smoker, day; /* smoker day No Fri 4 2.812500 Sat 45 3.102889 Sun 57 3.167895 Thur 45 2.673778 Yes Fri 15 2.714000 Sat 42 2.875476 Sun 19 3.516842 Thur 17 3.030000 */
Python语句:
tips.groupby(['smoker','day']).agg({'tip':[np.size,np.mean]})
缺失值的检查使用 notnull()
和 isnull()
重新建立一个测试数据集:
df = pd.DataFrame({'col2':['A','B',np.NaN, 'C', 'D'], 'col1':['F', np.NaN, 'G','H','I']})
SQL语句:
SELECT * FROM df WHERE col2 IS NULL;
Python语句:
# 选择变量是col为null的行(观测) df[df['col2'].isnull()]
SQL语句:
SELECT * FROM df WHERE col1 IS NOT NULL;
Python语句:
# 选择col1不是空值的行(观测) df[df['col1'].notnull()]
JOIN
可以使用 join()
或 merge()
执行 JOIN
。 默认情况下, join()
将在其索引上加入 DataFrames
。 每个方法都有参数允许您指定要执行的连接类型(LEFT,RIGHT,INNER,FULL)或要加入的列(列名称或索引)。
df1 = pd.DataFrame({'key':['A','B','C','D'], 'value':np.random.randn(4)}) df2 = pd.DataFrame({'key':['B','D','D','E'], 'value':np.random.randn(4)})
INNER JOIN
SQL语句:
SELECT * FROM df1 INNER JOIN df2 ON df1.key = df2.key;
Python语句:
pd.merge(df1,df2, on = 'key')
indexed_df2 = df2.set_index('key') pd.merge(df1, indexed_df2, left_on='key',right_index=True)
LEFT OUTER JOIN
SQL语句:
-- show all records from df2 SELECT * FROM df1 RIGHT OUTER JOIN df2 ON df1.key=df2.key;
Python语句:
pd.merge(df1, df2, on = 'key', how='left')
RIGHT OUTER JOIN
SQL语句:
-- show all records from both tables SELECT * FROM df1 FULL OUTER JOIN df2 ON df1.key = df2.key;
Python语句:
pd.merge(df1, df2, on = 'key', how='right')
FULL JOIN
SQL语句:
-- show all records from both tables SELECT * FROM df1 FULL OUTER JOIN df2 ON df1.key = df2.key;
Python语句:
pd.merge(df1, df2 , on = 'key', how = 'outer')
UNION
新建数据集:
df1 = pd.DataFrame({'city': ['Chicago', 'San Francisco', 'New York City'], 'rank': range(1, 4)}) df2 = pd.DataFrame({'city': ['Chicago', 'Boston', 'Los Angeles'], 'rank': [1, 4, 5]})
SQL语句:
SELECT city, rank FROM df1 UNION ALL SELECT city, rank FROM df2; /* city rank Chicago 1 San Francisco 2 New York City 3 Chicago 1 Boston 4 Los Angeles 5 */
Python语句:
pd.concat([df1,df2])
SQL UNION类似于UNION ALL,但是UNION将删除重复的行。
SELECT city, rank FROM df1 UNION SELECT city, rank FROM df2; -- notice that there is only one Chicago record this time /* city rank Chicago 1 San Francisco 2 New York City 3 Boston 4 Los Angeles 5 */
在pandas中,您可以使用 concat()
与 drop_duplicate()
结合使用。
pd.concat([df1, df2]).drop_duplicates()