【Python】链家网二手房购房决策树

整体流程：
1.数据抓取；
2.数据清洗；
3.建模及优化；
4.业务意义；
5.反思。

一、数据抓取

环境：python3.7

from parsel import Selector
import requests
import time
lines=[ ]
for i in range (1,3): #先测试，再抓取总数
    base_url='https://tj.lianjia.com/ershoufang/pg%s/'
    url=base_url % i
    content=requests.get(url)
    time.sleep(2) #休眠2秒，防止操作太频繁
    sel=Selector(text=content.text)
    for x in sel.css('.info.clear'): #Chrome开发者工具，查询路径
        title=x.css('a::text').extract_first()
        community=x.css('.houseInfo>a::text').extract_first()
        address=x.css('.address>::text').getall() 
        flood=x.css('.flood>::text').getall()  
        totalPrice=x.css('.totalPrice>span::text').extract()
        lines.append('%s,天津%s,%s,%s,%sW' % (title,community,address,flood,totalPrice))
        #print("lines",lines)
with open('tianjin_datas.csv','w') as f:
    for line in lines:
        f.write (line)
        f.write('\n')

二、数据清洗

将抓取的48359条数据，进行重复值、缺失值、异常值、字符类型转变、数值转换等处理，为建模做准备：

清洗前：
在这里插入图片描述

清洗代码：

import pandas as pd
import numpy as np  #引入，修改数值类型
  
#读取数据
lianjia=pd.read_csv('lianjia_raw_data.csv')
    
#剔除重复值、不需要的列、缺失值、异常值
lianjia=lianjia.drop_duplicates() #剔除重复项
lianjia=lianjia.drop(['location','community','attention','look_times','release_days','title'],axis=1) #剔除不需要的列
lianjia.isnull().sum() #计算空值数量
lianjia=lianjia.dropna(axis=0,how='any') #剔除空值
#lianjia.head()
#lianjia.shape

#剔除空格
lianjia['elevator']=lianjia['elevator'].map(str.strip)
lianjia['hall']=lianjia['hall'].map(str.strip)
lianjia['oriented']=lianjia['oriented'].apply(str.strip)

#替换值
lianjia['floor']=lianjia['floor'].replace(['低','中','高'],['low','middle','hige']) #改变列的值，为建模做准备
lianjia['elevator']=lianjia['elevator'].replace(['有','无'],['yes','no'])
lianjia['decoration']=lianjia['decoration'].replace(['毛坯','简装','精装','其他'],['blank','lite','fine','others'])
lianjia['hall']=lianjia['hall'].replace(['1室0厅','1室1厅','1室2厅','2室0厅','2室1厅','2室2厅','2室3厅','3室1厅','3室2厅','3室3厅','4室1厅','4室2厅','4室3厅','5室0厅','5室1厅','5室2厅','5室3厅','6室3厅'],['one room zero hall','one room one hall','one room two hall','two room zero hall','two room one hall','two room two hall','two room three hall','three room one hall','three room two hall','three room three hall','four room one hall','four room two hall','four room three hall','five room zero hall','five room one hall','five room two hall','five room three hall','six room three hall'])
lianjia['tag']=lianjia['tag'].replace(['近地铁','房本满两年','房本满五年','随时看房'],['near subway','house over two years','house over five years','visit any time'])
lianjia['oriented']=lianjia['oriented'].str[:1] #切片，保留第一个字符
lianjia['oriented']= lianjia['oriented'].replace(['南','东','西','北'],['south','east','east','north'])
lianjia.head(10) #替换后，查看效果

#更改数据格式
lianjia['total_price']=lianjia['total_price'].astype(np.int64)
lianjia['avg_price']=lianjia['avg_price'].astype(np.int64)
lianjia['layers']=lianjia['layers'].astype(np.int64)
lianjia.head(10) #查看效果

#保存数据
lianjia.to_csv('lianjia_tree_model.csv') #保存文件

清洗后：
在这里插入图片描述

构建意向数据（intention）：
在这里插入图片描述

三、构建决策树

模块：pandas、scikit-learn;
算法：CART；
环境：IPython3。

import pandas as pd
from sklearn.preprocessing import LabelEncoder
from collections import defaultdict
from sklearn import tree
from IPython.display import Image as PImage
from PIL import ImageDraw
from subprocess import check_call

df = pd.read_csv('lianjia_tree_model.csv')
# df.head()
# df.shape

X = df.drop('intention', axis=1)
y = df.intention

d = defaultdict(LabelEncoder)
X_trans = X.apply(lambda x: d[x.name].fit_transform(x))
# X_trans.head()

clf = tree.DecisionTreeClassifier(max_depth=4)  #此步，需要单独运行
clf = clf.fit(X_train, y_train)

with open("house_buy.dot", 'w') as f:
     f = tree.export_graphviz(clf,
                              out_file=f,
                              max_depth=4,
                              impurity = True,
                              feature_names = list(X_train),
                              class_names = ['not buy','buy'],
                              rounded = True,
                              filled= True )

check_call(['dot','-Tpng','house_buy.dot','-o','house_buy.png'])

#draw = ImageDraw.Draw(img)
img.save('house_buy.png')
PImage("house_buy.png")

初步效果如下，接下来逐步调整模型：
在这里插入图片描述

调试前：
1.模型是否过度拟合：
（1）从分支效果来看，逻辑无误；
（2）从gini系数来看，底部gini系数非常接近0，样本特征不错，但能否通过调整样本数量及树的深度，使模型更优，接下来将进行调整。

2.模型准确度：98.13%，是否能更为精确？
在这里插入图片描述

调试方法：
1.平均样本数量；
2.前后剪枝（剔除了非关键字段，并调整树的深度，加以对比）。

调试后：
1.决策树模型：
（1）逻辑：无误；
（2）gini系数：亦很接近0，样本特征不错。
在这里插入图片描述

2.准确度：98.71%，提升了0.59%(幅度，非数值差），调试效果还不错。
在这里插入图片描述

四、模型的业务意义
1.策略制定：根据模型进行判断，客户热衷与哪类房源，怎么才能最快去库存，避免盲目抢房，造成积压；
2.销售推广：针对各类人群需求，按模型层层推荐，快速满足客户的需求；
3.人力精简：提升了决策与销售的效率，降低人力成本；
4.业务方向：模型是根据现时的数据进行搭建的，那尚未开发的客户市场会不会与现今模型相反，造成未来决策失误？值得我们反思。

五、反思
1.拟合：

欠拟合：
前/后期剪枝太多，枝叶没有展开，会有欠拟合的风险；

过度拟合：
（1）样本量：如何筛选数据源，哪些才是最关键的字段；
（2）剪枝：先剪枝叶，还是后剪枝。

2.复杂情况：
对于异或，是否需要换成神经网络模型建模。

来源：CSDN

作者：MinotGeogry

链接：https://blog.csdn.net/weixin_43356605/article/details/83549609

标签

链家

replace