简书大神SeanCheney的译作,我作了些格式调整和文章目录结构的变化,更适合自己阅读,以后翻阅是更加方便自己查找吧
import pandas as pd import numpy as np
设定最大列数和最大行数
pd.set_option('max_columns', 5, 'max_rows', 10)
1 DataFrame的结构
movie = pd.read_csv('data/movie.csv')
movie.shape
(4916, 28)
2 访问DataFrame的组件
2.1 组件获取及其类型
columns = movie.columns
type(columns)
pandas.core.indexes.base.Index
columns
Index(['color', 'director_name', 'num_critic_for_reviews', 'duration', 'director_facebook_likes', 'actor_3_facebook_likes', 'actor_2_name', 'actor_1_facebook_likes', 'gross', 'genres', 'actor_1_name', 'movie_title', 'num_voted_users', 'cast_total_facebook_likes', 'actor_3_name', 'facenumber_in_poster', 'plot_keywords', 'movie_imdb_link', 'num_user_for_reviews', 'language', 'country', 'content_rating', 'budget', 'title_year', 'actor_2_facebook_likes', 'imdb_score', 'aspect_ratio', 'movie_facebook_likes'], dtype='object')
index = movie.index
type(index)
pandas.core.indexes.range.RangeIndex
index
RangeIndex(start=0, stop=4916, step=1)
data = movie.values
type(data)
numpy.ndarray
data
array([['Color', 'James Cameron', 723.0, ..., 7.9, 1.78, 33000], ['Color', 'Gore Verbinski', 302.0, ..., 7.1, 2.35, 0], ['Color', 'Sam Mendes', 602.0, ..., 6.8, 2.35, 85000], ..., ['Color', 'Benjamin Roberds', 13.0, ..., 6.3, nan, 16], ['Color', 'Daniel Hsia', 14.0, ..., 6.3, 2.35, 660], ['Color', 'Jon Gunn', 43.0, ..., 6.6, 1.85, 456]], dtype=object)
2.2 索引类型
判断是不是子类型
issubclass(pd.core.indexes.range.RangeIndex,pd.Index)
True
访问index的值,index的值是个列表,所以可以索引或切片
index.values
array([ 0, 1, 2, ..., 4913, 4914, 4915])
3 理解数据类型
movie.dtypes
color object director_name object num_critic_for_reviews float64 duration float64 director_facebook_likes float64 ... title_year float64 actor_2_facebook_likes float64 imdb_score float64 aspect_ratio float64 movie_facebook_likes int64 Length: 28, dtype: object
显示各类型的数量
movie.get_dtype_counts()
float64 13 int64 3 object 12 dtype: int64
4 Series 结构
选择一列数据,作为Series
movie['director_name']
0 James Cameron 1 Gore Verbinski 2 Sam Mendes 3 Christopher Nolan 4 Doug Walker ... 4911 Scott Smith 4912 NaN 4913 Benjamin Roberds 4914 Daniel Hsia 4915 Jon Gunn Name: director_name, Length: 4916, dtype: object
也可以通过属性的方式选取
movie.director_name
0 James Cameron 1 Gore Verbinski 2 Sam Mendes 3 Christopher Nolan 4 Doug Walker ... 4911 Scott Smith 4912 NaN 4913 Benjamin Roberds 4914 Daniel Hsia 4915 Jon Gunn Name: director_name, Length: 4916, dtype: object
type(movie['director_name'])
pandas.core.series.Series
4.1 调用Series方法
查看Series所有不重复的指令
s_attr_methods = set(dir(pd.Series))
len(s_attr_methods)
464
查看DataFrame所有不重复的指令
df_attr_methods = set(dir(pd.DataFrame))
len(df_attr_methods)
460
这两个集合中有多少共有的指令
len(s_attr_methods & df_attr_methods)
399
4.2 Series基础方法
选取director和actor_1_fb_likes两列
director = movie['director_name'] actor_1_fb_likes = movie['actor_1_facebook_likes']
查看series头部信息
director.head()
0 James Cameron 1 Gore Verbinski 2 Sam Mendes 3 Christopher Nolan 4 Doug Walker Name: director_name, dtype: object
统计series值出现的频数
director.value_counts()
Steven Spielberg 26 Woody Allen 22 Clint Eastwood 20 Martin Scorsese 20 Spike Lee 16 .. John Duigan 1 Ray Griggs 1 Lena Dunham 1 Dario Argento 1 Eric Mendelsohn 1 Name: director_name, Length: 2397, dtype: int64
统计series值出现的频率
director.value_counts(normalize=True)
Steven Spielberg 0.005401 Woody Allen 0.004570 Clint Eastwood 0.004155 Martin Scorsese 0.004155 Spike Lee 0.003324 ... John Duigan 0.000208 Ray Griggs 0.000208 Lena Dunham 0.000208 Dario Argento 0.000208 Eric Mendelsohn 0.000208 Name: director_name, Length: 2397, dtype: float64
长度相关
len(director)
4916
director.size
4916
director.shape
(4916,)
director有多少非空值
director.count()
4814
空值个数(会有更加直接的方法)
director.size - director.count()
102
4.3 Series统计信息
最小值、最大值、平均值、中位数、标准差、总和
actor_1_fb_likes.min(), actor_1_fb_likes.max()
(0.0, 640000.0)
actor_1_fb_likes.mean(), actor_1_fb_likes.median()
(6494.488490527602, 982.0)
actor_1_fb_likes.std(), actor_1_fb_likes.sum()
(15106.986883848309, 31881444.0)
数值描述信息
actor_1_fb_likes.describe()
count 4909.000000 mean 6494.488491 std 15106.986884 min 0.000000 25% 607.000000 50% 982.000000 75% 11000.000000 max 640000.000000 Name: actor_1_facebook_likes, dtype: float64
字符描述信息
director.describe()
count 4814 unique 2397 top Steven Spielberg freq 26 Name: director_name, dtype: object
任意分为点
actor_1_fb_likes.quantile(.2)
510.0
actor_1_fb_likes.quantile([.1, .2, .3, .4, .5, .6, .7, .8, .9])
0.1 240.0 0.2 510.0 0.3 694.0 0.4 854.0 0.5 982.0 0.6 1000.0 0.7 8000.0 0.8 13000.0 0.9 18000.0 Name: actor_1_facebook_likes, dtype: float64
4.4 空值处理
判断是否有缺失值
actor_1_fb_likes.hasnans
True
缺失值的个数
actor_1_fb_likes.isnull().sum()
7
选取缺失值
actor_1_fb_likes[actor_1_fb_likes.isnull()]
4401 NaN 4418 NaN 4608 NaN 4721 NaN 4822 NaN 4823 NaN 4864 NaN Name: actor_1_facebook_likes, dtype: float64
非空值
actor_1_fb_likes.isnull()
0 False 1 False 2 False 3 False 4 False ... 4911 False 4912 False 4913 False 4914 False 4915 False Name: actor_1_facebook_likes, Length: 4916, dtype: bool
bool_sig = actor_1_fb_likes.notnull()
判断所有的bool是否都为true
bool_sig.all()
False
填充缺失值
actor_1_fb_likes.count()
4909
actor_1_fb_likes_filled = actor_1_fb_likes.fillna(0)
actor_1_fb_likes_filled.count()
4916
删除缺失值
actor_1_fb_likes_dropped = actor_1_fb_likes.dropna()
actor_1_fb_likes_dropped.size
4909
4.5 在Series上使用运算符
imdb_score = movie['imdb_score']
加减乘除
imdb_score + 1
0 8.9 1 8.1 2 7.8 3 9.5 4 8.1 ... 4911 8.7 4912 8.5 4913 7.3 4914 7.3 4915 7.6 Name: imdb_score, Length: 4916, dtype: float64
函数实现
imdb_score.add(1)
0 8.9 1 8.1 2 7.8 3 9.5 4 8.1 ... 4911 8.7 4912 8.5 4913 7.3 4914 7.3 4915 7.6 Name: imdb_score, Length: 4916, dtype: float64
4.6 类型转化
imdb_score.dtype
dtype('float64')
imdb_score = imdb_score.astype(int)
imdb_score.dtype
dtype('int64')
5 使dataframe索引有意义
movie.shape
(4916, 28)
movie.tail()
color | director_name | ... | aspect_ratio | movie_facebook_likes | |
---|---|---|---|---|---|
4911 | Color | Scott Smith | ... | NaN | 84 |
4912 | Color | NaN | ... | 16.00 | 32000 |
4913 | Color | Benjamin Roberds | ... | NaN | 16 |
4914 | Color | Daniel Hsia | ... | 2.35 | 660 |
4915 | Color | Jon Gunn | ... | 1.85 | 456 |
5 rows × 28 columns
5.1 给索引重命名
movie.index.name = 'row_index'
movie.columns.name = 'col_index'
movie.tail()
col_index | color | director_name | ... | aspect_ratio | movie_facebook_likes |
---|---|---|---|---|---|
row_index | |||||
4911 | Color | Scott Smith | ... | NaN | 84 |
4912 | Color | NaN | ... | 16.00 | 32000 |
4913 | Color | Benjamin Roberds | ... | NaN | 16 |
4914 | Color | Daniel Hsia | ... | 2.35 | 660 |
4915 | Color | Jon Gunn | ... | 1.85 | 456 |
5 rows × 28 columns
5.2 重设索引
将dataframe中存在某列或多列作为索引
movie2 = movie.set_index('movie_title')
movie2.tail()
col_index | color | director_name | ... | aspect_ratio | movie_facebook_likes |
---|---|---|---|---|---|
movie_title | |||||
Signed Sealed Delivered | Color | Scott Smith | ... | NaN | 84 |
The Following | Color | NaN | ... | 16.00 | 32000 |
A Plague So Pleasant | Color | Benjamin Roberds | ... | NaN | 16 |
Shanghai Calling | Color | Daniel Hsia | ... | 2.35 | 660 |
My Date with Drew | Color | Jon Gunn | ... | 1.85 | 456 |
5 rows × 27 columns
另一种方式
movie = pd.read_csv('data/movie.csv',index_col = 'movie_title')
还原为默认整数索引
movie2.reset_index().tail()
col_index | movie_title | color | ... | aspect_ratio | movie_facebook_likes |
---|---|---|---|---|---|
4911 | Signed Sealed Delivered | Color | ... | NaN | 84 |
4912 | The Following | Color | ... | 16.00 | 32000 |
4913 | A Plague So Pleasant | Color | ... | NaN | 16 |
4914 | Shanghai Calling | Color | ... | 2.35 | 660 |
4915 | My Date with Drew | Color | ... | 1.85 | 456 |
5 rows × 28 columns
6 重命名行名和列名
通过rename()重命名
idx_rename = {'Avatar':'Ratava', 'Spectre': 'Ertceps'}
col_rename = {'director_name':'Director Name','num_critic_for_reviews': 'Critical Reviews'}
movie.rename(index=idx_rename, columns=col_rename).head()
color | Director Name | ... | aspect_ratio | movie_facebook_likes | |
---|---|---|---|---|---|
movie_title | |||||
Ratava | Color | James Cameron | ... | 1.78 | 33000 |
Pirates of the Caribbean: At World's End | Color | Gore Verbinski | ... | 2.35 | 0 |
Ertceps | Color | Sam Mendes | ... | 2.35 | 85000 |
The Dark Knight Rises | Color | Christopher Nolan | ... | 2.35 | 164000 |
Star Wars: Episode VII - The Force Awakens | NaN | Doug Walker | ... | NaN | 0 |
5 rows × 27 columns
列表的方式
index = movie.index columns = movie.columns
index_list = index.tolist() column_list = columns.tolist()
index_list[0] = 'Ratava' index_list[2] = 'Ertceps' column_list[1] = 'Director Name' column_list[2] = 'Critical Reviews'
movie.index = index_list movie.columns = column_list
movie.head()
color | Director Name | ... | aspect_ratio | movie_facebook_likes | |
---|---|---|---|---|---|
Ratava | Color | James Cameron | ... | 1.78 | 33000 |
Pirates of the Caribbean: At World's End | Color | Gore Verbinski | ... | 2.35 | 0 |
Ertceps | Color | Sam Mendes | ... | 2.35 | 85000 |
The Dark Knight Rises | Color | Christopher Nolan | ... | 2.35 | 164000 |
Star Wars: Episode VII - The Force Awakens | NaN | Doug Walker | ... | NaN | 0 |
5 rows × 27 columns
7 创建、删除列
通过[列名]添加新列
movie = pd.read_csv('data/movie.csv')
movie['has_seen'] = 0
movie['actor_director_facebook_likes'] = (movie['actor_1_facebook_likes'] + movie['actor_2_facebook_likes'])
movie.shape,movie['actor_director_facebook_likes'].shape
((4916, 30), (4916,))
删除行/列
movie.drop(['actor_director_facebook_likes','actor_1_facebook_likes'],axis=1)
color | director_name | ... | movie_facebook_likes | has_seen | |
---|---|---|---|---|---|
0 | Color | James Cameron | ... | 33000 | 0 |
1 | Color | Gore Verbinski | ... | 0 | 0 |
2 | Color | Sam Mendes | ... | 85000 | 0 |
3 | Color | Christopher Nolan | ... | 164000 | 0 |
4 | NaN | Doug Walker | ... | 0 | 0 |
... | ... | ... | ... | ... | ... |
4911 | Color | Scott Smith | ... | 84 | 0 |
4912 | Color | NaN | ... | 32000 | 0 |
4913 | Color | Benjamin Roberds | ... | 16 | 0 |
4914 | Color | Daniel Hsia | ... | 660 | 0 |
4915 | Color | Jon Gunn | ... | 456 | 0 |
4916 rows × 28 columns
movie.drop([0,2])
color | director_name | ... | has_seen | actor_director_facebook_likes | |
---|---|---|---|---|---|
1 | Color | Gore Verbinski | ... | 0 | 45000.0 |
3 | Color | Christopher Nolan | ... | 0 | 50000.0 |
4 | NaN | Doug Walker | ... | 0 | 143.0 |
5 | Color | Andrew Stanton | ... | 0 | 1272.0 |
6 | Color | Sam Raimi | ... | 0 | 35000.0 |
... | ... | ... | ... | ... | ... |
4911 | Color | Scott Smith | ... | 0 | 1107.0 |
4912 | Color | NaN | ... | 0 | 1434.0 |
4913 | Color | Benjamin Roberds | ... | 0 | 0.0 |
4914 | Color | Daniel Hsia | ... | 0 | 1665.0 |
4915 | Color | Jon Gunn | ... | 0 | 109.0 |
4914 rows × 30 columns
来源:https://www.cnblogs.com/shiyushiyu/p/9712998.html