单变量分析
首先对平台客户的基本信息进行分析,包含所在地,信用状况,申请贷款原因等,旨在分析目标客户所倾向具备的一般特征:
- 所在地区分布:
library(ggplot2)
ggplot(data=subset(data,!data$BorrowerState==""),
aes(x=BorrowerState))+geom_bar(fill="pink",color="black")+
theme(axis.text = element_text(size = 5) )
可以看到公司客户在加利福尼亚州、纽约州、弗洛里达州、德克萨斯州、伊利诺斯州分布较多,领先于其他各州,可以适当增大在其余州的宣传力度,开发新客户。Prosper总部位于旧金山,可能也与加利佛尼亚州的使用人数最多有关。
- 违约次数分析:
ggplot(data=subset(data,!data$DelinquenciesLast7Years==""),
aes(x=DelinquenciesLast7Years))+geom_bar(fill="orange",color="black")+
theme(axis.text = element_text(size = 5) )+scale_x_continuous(limits = c(-1,50))
- 客户就业情况:
ggplot(aes(EmploymentStatus),data = subset(data,!(data$EmploymentStatus==""))) +
geom_bar(color="black",fill=I("#B2DFEE"),width = 0.5) +
theme(axis.text.x=element_text(angle = 90,hjust = 1,vjust=0,size=8))
可以看出平台大部分客户均受雇聘,或者为全职,拥有工作,收入稳定。
- 客户征信查询次数:
bar_plot <- function(varname, binwidth) {
return(ggplot(aes_string(x = varname), data = data) + geom_histogram(binwidth = binwidth))
}
bar_plot('InquiriesLast6Months',1)+
coord_cartesian(xlim=c(0,quantile(data$InquiriesLast6Months,probs = 0.95,
"na.rm" = TRUE)))+
geom_vline(xintercept = quantile(data$InquiriesLast6Months,
probs = 0.95, "na.rm" = TRUE),
linetype = "dashed", color = "red")+
theme(panel.background =element_rect(fill="white"))
征信查询的次数表示近期借款人的贷款申请次数,次数越多表示资金一定程度上越紧张。图上可以看出95%下的客户贷款次数均小于5次。
- 客户负债收益比情况:
bar_plot('DebtToIncomeRatio',0.04)+
coord_cartesian(xlim=c(0,quantile(data$DebtToIncomeRatio,probs = 0.95,
"na.rm" = TRUE)))+
geom_vline(xintercept = quantile(data$DebtToIncomeRatio,
probs = 0.95, "na.rm" = TRUE),
linetype = "dashed", color = "red")+
theme(panel.background =element_rect(fill="white"))
负债收益比越高的人,偿还贷款能力越低,平台95%的人负债收益比小于0.5,整体来说客户的负债收益比较低。
- 客户的月收入:
bar_plot('StatedMonthlyIncome',425)+
scale_x_continuous(limits = (c(0,15000)),breaks = seq(0,15000,500))+
geom_vline(xintercept = 5000, linetype = "dashed", color = "red")+
geom_vline(xintercept = 3000, linetype = "dashed", color = "red")+
theme(panel.background =element_rect(fill="white"))+
theme(axis.text.x=element_text(angle = 90,hjust = 1,vjust=0,size=8))
可以看出大部分借贷人的月薪在3000~5000美金之间。
- 贷款原因:
ggplot(data,aes(x=ListingCategory..numeric.))+
geom_bar(color="black",fill=I("#70DBDB"))+scale_x_continuous(breaks = c(0:20))+scale_y_sqrt()
通过该分析可以看出主要的贷款用途集中在类别1、0、7。因没有给出对应的具体含义,所以尚不清楚具体的贷款目的,可以通过完整的资料进行查询。
- 平台用户信用情况(等级/评分):
library(gridExtra)
data$creditlevel <- factor(data$creditlevel,order=TRUE,levels = c("AA","A","B","C","D","E","HR"))
data$CreditGrade <- factor(data$CreditGrade,order=TRUE,levels = c("AA","A","B","C","D","E","HR"))
data$ProsperRating..Alpha. <- factor(data$ProsperRating..Alpha.,order=TRUE,
levels = c("AA","A","B","C","D","E","HR"))
p1 <- ggplot(data,aes(x=creditscore))+
geom_histogram(binwidth=20,color="black",fill=I("#DBDB70"))+
scale_x_continuous(limits = c(400,900))
p2 <- ggplot(data=subset(data,data$CreditGrade!=""& data$CreditGrade!="NC"),aes(x=CreditGrade))+
geom_bar(color="black",fill=I("#7093DB"))+
xlab("creditlevel(pre2009)")
p3 <- ggplot(data=subset(data,data$ProsperRating..Alpha.!=""),
aes(x=ProsperRating..Alpha.))+
geom_bar(color="black",fill=I("#E9C2A6"))+
xlab("creditlevel(after2009)")
p4 <- ggplot(data=subset(data,!is.na(data$creditlevel)),aes(x=creditlevel))+
geom_bar(color="black",fill=I("#EAADEA"))
grid.arrange(p1,p2,p3,p4,ncol = 1)
根据客户的信用等级及评分制图,可以看出基本呈正态分布,信用评分主要集中于650-750分,信用等级集中于B,C,D,且2009年以后A级用户和AA级用户以及尾端E和HR级用户的划分更为明确。
来源:CSDN
作者:孔胖
链接:https://blog.csdn.net/xiuxiuxiu666/article/details/104246663