机器学习实战（五）泰坦尼克号获救预测代码运行问题解决办法

一、读取数据，观察数据分布

import pandas #ipython notebook
titanic = pandas.read_csv("titanic_train.csv")
print(titanic.head(5))
#print (titanic.describe())#查看每一列的情况
#print(titanic.shape)#(891, 12)
#结果如下图：

在这里插入图片描述

1.分析：

survived：这一列，1-存活，0-死亡
sex：是文字形式，不利于分析，故可能需要映射到数值的值
age：这一列空缺了一百多个值，从逻辑上考虑年龄还是很重要的，所以缺失值需要填补
Ticket：这列船票号，看起来没规律。。。
Fare：船票费用和船舱等级（Pclass）以及航程长短（Embarked）有关。
Cabin：这个缺失值太多了，代表含义不清晰，先忽略。
Embarked：上船港口，有三个取值，C/S/Q，是文字形式，不利于分析，故可能需要映射到数值的值，而且有2个缺失值

二、数据预处理

1. 填充缺失值可以采取：平均值/中值/众数等填充方式。 Age这列平均值和中值都可以考虑一下（看具体效果决定），Embarked就缺了俩，而且取值就3个离散值，故用众数比较合理。

1.Age

titanic["Age"] = titanic["Age"].fillna(titanic["Age"].median())#数据填充（用均值）
print (titanic.describe())

2. Embarked

print(titanic['Embarked'].unique()) #取值可能的结果：['S' 'C' 'Q' nan]
print(titanic['Embarked'].mode())   #'众数'是s，那就用s
titanic['Embarked']=titanic['Embarked'].fillna('S')
print(titanic['Embarked'].describe())

‘’‘结果：
[‘S’ ‘C’ ‘Q’ nan]
0 S
dtype: object
count 891
unique 3
top S
freq 646
Name: Embarked, dtype: object’’’

2. 文字到数值的映射

（1）性别：male-0, female-1

print (titanic["Sex"].unique()) #（sex的可能性）
titanic.loc[titanic["Sex"] == "male", "Sex"] = 0  #男标为0  #Replace all the occurences of male with the number 0.
titanic.loc[titanic["Sex"] == "female", "Sex"] = 1#女标为1
#结果：
#['male' 'female']

（2）港口：S-0, C-1, Q-2

print (titanic["Embarked"].unique())#（Embarked的可能性）
titanic.loc[titanic['Embarked']=='S','Embarked']=0
titanic.loc[titanic['Embarked']=='C','Embarked']=1
titanic.loc[titanic['Embarked']=='Q','Embarked']=2
print(titanic['Embarked'].describe())

结果：[‘S’ ‘C’ ‘Q’]
count 891.000000
mean 0.361392
std 0.635673
min 0.000000
25% 0.000000
50% 0.000000
75% 1.000000
max 2.000000
Name: Embarked, dtype: float64

三、模型

1. 用线性回归预测

（1）线性回归：找到一条直线（超平面）来拟合数据点
（2）本案例中，找到拟合平面之后求解每个人的存活概率，若大于0.5 则存活，survived预测值=1

#Import the linear regression class
from sklearn.linear_model import LinearRegression# Import the linear regression class
from sklearn.model_selection import KFold   #（由from sklearn.cross_validation import KFold改为from sklearn.model_selection import KFold）
#The columns we'll use to predict the target
predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]
# Initialize our algorithm class
alg = LinearRegression()
# Generate cross validation folds for the titanic dataset.  It return the row indices corresponding to train and test.
# We set random_state to ensure we get the same splits every time we run this.
kf = KFold(3,shuffle=False,random_state=1)
#（由kf = KFold(titanic.shape[0], n_folds=3, random_state=1)改为kf =KFold(3,shuffle=False,random_state=1)）
predictions = []
for train, test in kf.split(titanic):#（由for train, test in kf:改为for train, test in kf.split(titanic):）
    # The predictors we're using the train the algorithm.  Note how we only take the rows in the train folds.
    train_predictors = (titanic[predictors].iloc[train,:])
    # The target we're using to train the algorithm.
    train_target = titanic["Survived"].iloc[train]
    # Training the algorithm using the predictors and target.
    alg.fit(train_predictors, train_target)
    # We can now make predictions on the test fold
    test_predictions = alg.predict(titanic[predictors].iloc[test,:])
    predictions.append(test_predictions)
import numpy as np
# The predictions are in three separate numpy arrays.  Concatenate them into one.  
# We concatenate them on axis 0, as they only have one axis.
predictions = np.concatenate(predictions, axis=0)
predictions[predictions > .5] = 1
predictions[predictions <=.5] = 0
accuracy=len(predictions[predictions==titanic['Survived']])/(len(predictions))
#（由accuracy = sum(predictions[predictions == titanic["Survived"]]) / len(predictions)改为accuracy=len(predictions[predictions==titanic['Survived']])/(len(predictions))）
print(accuracy)#结果：0.7833894500561167

2. 用逻辑回归预测

from sklearn.model_selection import _validation #（由from sklearn import cross_validation改为from sklearn.model_selection import _validation）
from sklearn.linear_model import LogisticRegression
alg = LogisticRegression(random_state=1,solver='liblinear')#（加了solver='liblinear'）
scores = _validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=3)
#（由scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=3)改为scores = _validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=3)）
print(scores.mean())   #0.7878787878787877

3. 用随机森林改进模型

titanic_test = pandas.read_csv("test.csv")  #（测试集）
titanic_test["Age"] = titanic_test["Age"].fillna(titanic["Age"].median())
titanic_test["Fare"] = titanic_test["Fare"].fillna(titanic_test["Fare"].median())
titanic_test.loc[titanic_test["Sex"] == "male", "Sex"] = 0 
titanic_test.loc[titanic_test["Sex"] == "female", "Sex"] = 1
titanic_test["Embarked"] = titanic_test["Embarked"].fillna("S")

titanic_test.loc[titanic_test["Embarked"] == "S", "Embarked"] = 0
titanic_test.loc[titanic_test["Embarked"] == "C", "Embarked"] = 1
titanic_test.loc[titanic_test["Embarked"] == "Q", "Embarked"] = 2


#随机森林
from sklearn.model_selection import _validation#（由from sklearn import cross_validation改为from sklearn.model_selection import _validation）
from sklearn.ensemble import RandomForestClassifier
predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]
alg = RandomForestClassifier(random_state=1, n_estimators=10, min_samples_split=2, min_samples_leaf=1)
kf = KFold(3,shuffle=False,random_state=1)#（由kf = cross_validation.KFold(titanic.shape[0], n_folds=3, random_state=1)改为kf = KFold(3,shuffle=False,random_state=1)）
scores = _validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=kf)
#（由scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=kf)改为scores = _validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=kf))
print(scores.mean())   # 0.7856341189674523

#随机森林的参数调节
alg = RandomForestClassifier(random_state=1, n_estimators=100, min_samples_split=4, min_samples_leaf=2)
kf = KFold(3,shuffle=False,random_state=1)#（由kf = cross_validation.KFold(titanic.shape[0], n_folds=3, random_state=1)改为kf = KFold(3,shuffle=False,random_state=1)）
scores = _validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=kf)
#（由scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=kf)改为scores = _validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=kf))
print(scores.mean())   #0.8148148148148148

四、特征工程示例

1. 如何自己构造特征（增加新的特征进行机器学习）

#（1）FamilySize = SibSp + ParCh
titanic["FamilySize"] = titanic["SibSp"] + titanic["Parch" ] #家庭成员量
#（2）NameLength = len (Name) 
titanic["NameLength"] = titanic["Name"].apply(lambda x: len(x))  #名字长度

（3）一些特殊的称号。例如：miss / doctor （称谓的buff）

import re
def get_title(name):
    # Use a regular expression to search for a title.  Titles always consist of capital and lowercase letters, and end with a period.
    title_search = re.search(' ([A-Za-z]+)\.', name)
    # If the title exists, extract and return it.
    if title_search:
        return title_search.group(1)
    return ""
titles = titanic["Name"].apply(get_title)
print(pandas.value_counts(titles))

title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Dr": 5, "Rev": 6, "Major": 7, "Col": 7, "Mlle": 8, "Mme": 8, "Don": 9, "Lady": 10, "Countess": 10, "Jonkheer": 10, "Sir": 9, "Capt": 7, "Ms": 2}
for k,v in title_mapping.items():
    titles[titles == k] = v
print(pandas.value_counts(titles))
titanic["Title"] = titles

结果：
Mr 517
Miss 182
Mrs 125
Master 40
Dr 7
Rev 6
Col 2
Major 2
Mlle 2
Mme 1
Capt 1
Jonkheer 1
Lady 1
Sir 1
Ms 1
Don 1
Countess 1
Name: Name, dtype: int64
1 517
2 183
3 125
4 40
5 7
6 6
7 5
10 3
8 3
9 2
Name: Name, dtype: int64

2. 随机森林特征重要性分析

import numpy as np
from sklearn.feature_selection import SelectKBest, f_classif
import matplotlib.pyplot as plt
predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked", "FamilySize", "Title", "NameLength"]
#Perform feature selection
selector = SelectKBest(f_classif, k=5)
selector.fit(titanic[predictors], titanic["Survived"])
#Get the raw p-values for each feature, and transform from p-values into scores
scores = -np.log10(selector.pvalues_)
#Plot the scores.  See how "Pclass", "Sex", "Title", and "Fare" are the best?
plt.bar(range(len(predictors)), scores)
plt.xticks(range(len(predictors)), predictors, rotation='vertical')
plt.show()
#Pick only the four best features.
predictors = ["Pclass", "Sex", "Fare", "Title"]
alg = RandomForestClassifier(random_state=1, n_estimators=50, min_samples_split=8, min_samples_leaf=4)
#结果如下图：

在这里插入图片描述

3. 随机森林特征重要性分析

from sklearn.ensemble import GradientBoostingClassifier
import numpy as np
#The algorithms we want to ensemble.
#We're using the more linear predictors for the logistic regression, and everything with the gradient boosting classifier.
algorithms = [
    [GradientBoostingClassifier(random_state=1, n_estimators=25, max_depth=3), ["Pclass", "Sex", "Age", "Fare", "Embarked", "FamilySize", "Title",]],
    [LogisticRegression(random_state=1,solver='liblinear'), ["Pclass", "Sex", "Fare", "FamilySize", "Title", "Age", "Embarked"]]
]  #（GradientBoostingClassifier和LogisticRegression组合利用）
#nitialize the cross validation folds
kf = KFold(3,shuffle=False,random_state=1)#（由kf = cross_validation.KFold(titanic.shape[0], n_folds=3, random_state=1)改为kf = KFold(3,shuffle=False,random_state=1)）
predictions = []
for train, test in kf.split(titanic):#（由for train, test in kf:改为for train, test in kf.split(titanic):）
    train_target = titanic["Survived"].iloc[train]
    full_test_predictions = []
    # Make predictions for each algorithm on each fold
    for alg, predictors in algorithms:
        # Fit the algorithm on the training data.
        alg.fit(titanic[predictors].iloc[train,:], train_target)
        # Select and predict on the test fold.  
        # The .astype(float) is necessary to convert the dataframe to all floats and avoid an sklearn error.
        test_predictions = alg.predict_proba(titanic[predictors].iloc[test,:].astype(float))[:,1]
        full_test_predictions.append(test_predictions)
    # Use a simple ensembling scheme -- just average the predictions to get the final classification.
    test_predictions = (full_test_predictions[0] + full_test_predictions[1]) / 2
    # Any value over .5 is assumed to be a 1 prediction, and below .5 is a 0 prediction.
    test_predictions[test_predictions <= .5] = 0
    test_predictions[test_predictions > .5] = 1
    predictions.append(test_predictions)
#Put all the predictions together into one array.
predictions = np.concatenate(predictions, axis=0)
#Compute accuracy by comparing to the training data.
accuracy = len(predictions[predictions == titanic["Survived"]]) / len(predictions)
#（由accuracy = sum(predictions[predictions == titanic["Survived"]]) / len(predictions)改为accuracy=len(predictions[predictions==titanic['Survived']])/(len(predictions))）
print(accuracy)  #结果：0.8215488215488216

来源：CSDN

作者：jiang cheng 828

链接：https://blog.csdn.net/kjcm123456/article/details/92079326

标签

test

random

titanic

泰坦尼克号获救预测（针对anaconda中运行遇到的问题的解决办法）

机器学习实战（五） 泰坦尼克号获救预测代码运行问题解决办法

一、 读取数据，观察数据分布