泰坦尼克号获救预测(针对anaconda中运行遇到的问题的解决办法)

 ̄綄美尐妖づ 提交于 2019-12-04 00:26:49

机器学习实战(五) 泰坦尼克号获救预测代码运行问题解决办法

一、 读取数据,观察数据分布

import pandas #ipython notebook
titanic = pandas.read_csv("titanic_train.csv")
print(titanic.head(5))
#print (titanic.describe())#查看每一列的情况
#print(titanic.shape)#(891, 12)
#结果如下图:

在这里插入图片描述

1.分析:

survived:这一列,1-存活,0-死亡
sex:是文字形式,不利于分析,故可能需要映射到数值的值
age:这一列空缺了一百多个值,从逻辑上考虑年龄还是很重要的,所以缺失值需要填补
Ticket:这列船票号,看起来没规律。。。
Fare:船票费用和船舱等级(Pclass)以及航程长短(Embarked)有关。
Cabin:这个缺失值太多了,代表含义不清晰,先忽略。
Embarked:上船港口,有三个取值,C/S/Q,是文字形式,不利于分析,故可能需要映射到数值的值,而且有2个缺失值

二、 数据预处理

1. 填充缺失值 可以采取:平均值/中值/众数等填充方式。 Age这列平均值和中值都可以考虑一下(看具体效果决定),Embarked就缺了俩,而且取值就3个离散值,故用众数比较合理。

1.Age

titanic["Age"] = titanic["Age"].fillna(titanic["Age"].median())#数据填充(用均值)
print (titanic.describe())

2. Embarked

print(titanic['Embarked'].unique()) #取值可能的结果:['S' 'C' 'Q' nan]
print(titanic['Embarked'].mode())   #'众数'是s,那就用s
titanic['Embarked']=titanic['Embarked'].fillna('S')
print(titanic['Embarked'].describe())

‘’‘结果:
[‘S’ ‘C’ ‘Q’ nan]
0 S
dtype: object
count 891
unique 3
top S
freq 646
Name: Embarked, dtype: object’’’

2. 文字到数值的映射

(1)性别:male-0, female-1

print (titanic["Sex"].unique()) #(sex的可能性)
titanic.loc[titanic["Sex"] == "male", "Sex"] = 0  #男标为0  #Replace all the occurences of male with the number 0.
titanic.loc[titanic["Sex"] == "female", "Sex"] = 1#女标为1
#结果:
#['male' 'female']

(2)港口:S-0, C-1, Q-2

print (titanic["Embarked"].unique())#(Embarked的可能性)
titanic.loc[titanic['Embarked']=='S','Embarked']=0
titanic.loc[titanic['Embarked']=='C','Embarked']=1
titanic.loc[titanic['Embarked']=='Q','Embarked']=2
print(titanic['Embarked'].describe())

结果:[‘S’ ‘C’ ‘Q’]
count 891.000000
mean 0.361392
std 0.635673
min 0.000000
25% 0.000000
50% 0.000000
75% 1.000000
max 2.000000
Name: Embarked, dtype: float64

三、模型

1. 用线性回归预测

(1)线性回归:找到一条直线(超平面)来拟合数据点
(2)本案例中,找到拟合平面之后求解每个人的存活概率,若大于0.5 则存活,survived预测值=1

#Import the linear regression class
from sklearn.linear_model import LinearRegression# Import the linear regression class
from sklearn.model_selection import KFold   #(由from sklearn.cross_validation import KFold改为from sklearn.model_selection import KFold)
#The columns we'll use to predict the target
predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]
# Initialize our algorithm class
alg = LinearRegression()
# Generate cross validation folds for the titanic dataset.  It return the row indices corresponding to train and test.
# We set random_state to ensure we get the same splits every time we run this.
kf = KFold(3,shuffle=False,random_state=1)
#(由kf = KFold(titanic.shape[0], n_folds=3, random_state=1)改为kf =KFold(3,shuffle=False,random_state=1))
predictions = []
for train, test in kf.split(titanic):#(由for train, test in kf:改为for train, test in kf.split(titanic):)
    # The predictors we're using the train the algorithm.  Note how we only take the rows in the train folds.
    train_predictors = (titanic[predictors].iloc[train,:])
    # The target we're using to train the algorithm.
    train_target = titanic["Survived"].iloc[train]
    # Training the algorithm using the predictors and target.
    alg.fit(train_predictors, train_target)
    # We can now make predictions on the test fold
    test_predictions = alg.predict(titanic[predictors].iloc[test,:])
    predictions.append(test_predictions)
import numpy as np
# The predictions are in three separate numpy arrays.  Concatenate them into one.  
# We concatenate them on axis 0, as they only have one axis.
predictions = np.concatenate(predictions, axis=0)
predictions[predictions > .5] = 1
predictions[predictions <=.5] = 0
accuracy=len(predictions[predictions==titanic['Survived']])/(len(predictions))
#(由accuracy = sum(predictions[predictions == titanic["Survived"]]) / len(predictions)改为accuracy=len(predictions[predictions==titanic['Survived']])/(len(predictions)))
print(accuracy)#结果:0.7833894500561167

2. 用逻辑回归预测

from sklearn.model_selection import _validation #(由from sklearn import cross_validation改为from sklearn.model_selection import _validation)
from sklearn.linear_model import LogisticRegression
alg = LogisticRegression(random_state=1,solver='liblinear')#(加了solver='liblinear')
scores = _validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=3)
#(由scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=3)改为scores = _validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=3))
print(scores.mean())   #0.7878787878787877

3. 用随机森林改进模型

titanic_test = pandas.read_csv("test.csv")  #(测试集)
titanic_test["Age"] = titanic_test["Age"].fillna(titanic["Age"].median())
titanic_test["Fare"] = titanic_test["Fare"].fillna(titanic_test["Fare"].median())
titanic_test.loc[titanic_test["Sex"] == "male", "Sex"] = 0 
titanic_test.loc[titanic_test["Sex"] == "female", "Sex"] = 1
titanic_test["Embarked"] = titanic_test["Embarked"].fillna("S")

titanic_test.loc[titanic_test["Embarked"] == "S", "Embarked"] = 0
titanic_test.loc[titanic_test["Embarked"] == "C", "Embarked"] = 1
titanic_test.loc[titanic_test["Embarked"] == "Q", "Embarked"] = 2


#随机森林
from sklearn.model_selection import _validation#(由from sklearn import cross_validation改为from sklearn.model_selection import _validation)
from sklearn.ensemble import RandomForestClassifier
predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]
alg = RandomForestClassifier(random_state=1, n_estimators=10, min_samples_split=2, min_samples_leaf=1)
kf = KFold(3,shuffle=False,random_state=1)#(由kf = cross_validation.KFold(titanic.shape[0], n_folds=3, random_state=1)改为kf = KFold(3,shuffle=False,random_state=1))
scores = _validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=kf)
#(由scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=kf)改为scores = _validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=kf))
print(scores.mean())   # 0.7856341189674523

#随机森林的参数调节
alg = RandomForestClassifier(random_state=1, n_estimators=100, min_samples_split=4, min_samples_leaf=2)
kf = KFold(3,shuffle=False,random_state=1)#(由kf = cross_validation.KFold(titanic.shape[0], n_folds=3, random_state=1)改为kf = KFold(3,shuffle=False,random_state=1))
scores = _validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=kf)
#(由scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=kf)改为scores = _validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=kf))
print(scores.mean())   #0.8148148148148148

四、特征工程示例

1. 如何自己构造特征(增加新的特征进行机器学习)

#(1)FamilySize = SibSp + ParCh
titanic["FamilySize"] = titanic["SibSp"] + titanic["Parch" ] #家庭成员量
#(2)NameLength = len (Name) 
titanic["NameLength"] = titanic["Name"].apply(lambda x: len(x))  #名字长度

(3)一些特殊的称号。例如:miss / doctor (称谓的buff)

import re
def get_title(name):
    # Use a regular expression to search for a title.  Titles always consist of capital and lowercase letters, and end with a period.
    title_search = re.search(' ([A-Za-z]+)\.', name)
    # If the title exists, extract and return it.
    if title_search:
        return title_search.group(1)
    return ""
titles = titanic["Name"].apply(get_title)
print(pandas.value_counts(titles))

title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Dr": 5, "Rev": 6, "Major": 7, "Col": 7, "Mlle": 8, "Mme": 8, "Don": 9, "Lady": 10, "Countess": 10, "Jonkheer": 10, "Sir": 9, "Capt": 7, "Ms": 2}
for k,v in title_mapping.items():
    titles[titles == k] = v
print(pandas.value_counts(titles))
titanic["Title"] = titles

结果:
Mr 517
Miss 182
Mrs 125
Master 40
Dr 7
Rev 6
Col 2
Major 2
Mlle 2
Mme 1
Capt 1
Jonkheer 1
Lady 1
Sir 1
Ms 1
Don 1
Countess 1
Name: Name, dtype: int64
1 517
2 183
3 125
4 40
5 7
6 6
7 5
10 3
8 3
9 2
Name: Name, dtype: int64

2. 随机森林特征重要性分析

import numpy as np
from sklearn.feature_selection import SelectKBest, f_classif
import matplotlib.pyplot as plt
predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked", "FamilySize", "Title", "NameLength"]
#Perform feature selection
selector = SelectKBest(f_classif, k=5)
selector.fit(titanic[predictors], titanic["Survived"])
#Get the raw p-values for each feature, and transform from p-values into scores
scores = -np.log10(selector.pvalues_)
#Plot the scores.  See how "Pclass", "Sex", "Title", and "Fare" are the best?
plt.bar(range(len(predictors)), scores)
plt.xticks(range(len(predictors)), predictors, rotation='vertical')
plt.show()
#Pick only the four best features.
predictors = ["Pclass", "Sex", "Fare", "Title"]
alg = RandomForestClassifier(random_state=1, n_estimators=50, min_samples_split=8, min_samples_leaf=4)
#结果如下图:

在这里插入图片描述

3. 随机森林特征重要性分析

from sklearn.ensemble import GradientBoostingClassifier
import numpy as np
#The algorithms we want to ensemble.
#We're using the more linear predictors for the logistic regression, and everything with the gradient boosting classifier.
algorithms = [
    [GradientBoostingClassifier(random_state=1, n_estimators=25, max_depth=3), ["Pclass", "Sex", "Age", "Fare", "Embarked", "FamilySize", "Title",]],
    [LogisticRegression(random_state=1,solver='liblinear'), ["Pclass", "Sex", "Fare", "FamilySize", "Title", "Age", "Embarked"]]
]  #(GradientBoostingClassifier和LogisticRegression组合利用)
#nitialize the cross validation folds
kf = KFold(3,shuffle=False,random_state=1)#(由kf = cross_validation.KFold(titanic.shape[0], n_folds=3, random_state=1)改为kf = KFold(3,shuffle=False,random_state=1))
predictions = []
for train, test in kf.split(titanic):#(由for train, test in kf:改为for train, test in kf.split(titanic):)
    train_target = titanic["Survived"].iloc[train]
    full_test_predictions = []
    # Make predictions for each algorithm on each fold
    for alg, predictors in algorithms:
        # Fit the algorithm on the training data.
        alg.fit(titanic[predictors].iloc[train,:], train_target)
        # Select and predict on the test fold.  
        # The .astype(float) is necessary to convert the dataframe to all floats and avoid an sklearn error.
        test_predictions = alg.predict_proba(titanic[predictors].iloc[test,:].astype(float))[:,1]
        full_test_predictions.append(test_predictions)
    # Use a simple ensembling scheme -- just average the predictions to get the final classification.
    test_predictions = (full_test_predictions[0] + full_test_predictions[1]) / 2
    # Any value over .5 is assumed to be a 1 prediction, and below .5 is a 0 prediction.
    test_predictions[test_predictions <= .5] = 0
    test_predictions[test_predictions > .5] = 1
    predictions.append(test_predictions)
#Put all the predictions together into one array.
predictions = np.concatenate(predictions, axis=0)
#Compute accuracy by comparing to the training data.
accuracy = len(predictions[predictions == titanic["Survived"]]) / len(predictions)
#(由accuracy = sum(predictions[predictions == titanic["Survived"]]) / len(predictions)改为accuracy=len(predictions[predictions==titanic['Survived']])/(len(predictions)))
print(accuracy)  #结果:0.8215488215488216
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!