大二下实训课结业作业,想着就爬个工作信息,原本是要用python的,后面想想就用java试试看,
java就自学了一个月左右,想要锻炼一下自己面向对象的思想等等的,
然后网上转了一圈,拉钩什么的是动态生成的网页,51job是静态网页,比较方便,就决定爬51job了。
参考https://blog.csdn.net/qq_42982169/article/details/83155040,改了许多地方,方便模块化,加了保存的功能
前提:
创建Maven Project方便包管理
使用httpclient 3.1以及jsoup1.8.3作为爬取网页和筛选信息的包,这两个版本用的人多。
mysql-connect-java 8.0.13用来将数据导入数据库,支持mysql8.0+
分析使用,tablesaw(可选,会用的就行)
“大数据+上海”以此URL为例子,只要是类似的URL都可行
https://search.51job.com/list/020000,000000,0000,00,9,99,%25E5%25A4%25A7%25E6%2595%25B0%25E6%258D%25AE,2,1.html?lang=c&stype=&postchannel=0000&workyear=99&cotype=99°reefrom=99&jobterm=99&companysize=99&providesalary=99&lonlat=0%2C0&radius=-1&ord_field=0&confirmdate=9&fromType=&dibiaoid=0&address=&line=&specialarea=00&from=&welfare=
先设计了个大概的功能,修改了好几版,最后觉得这样思路比较清晰,以JobBean容器作为所有功能的媒介
先完成爬取网页,以及保存到本地
创建JobBean对象
public class JobBean {
private String jobName;
private String company;
private String address;
private String salary;
private String date;
private String jobURL;
public JobBean(String jobName, String company, String address, String salary, String date, String jobURL) {
this.jobName = jobName;
this.company = company;
this.address = address;
this.salary = salary;
this.date = date;
this.jobURL = jobURL;
}
@Override
public String toString() {
return "jobName=" + jobName + ", company=" + company + ", address=" + address + ", salary=" + salary
+ ", date=" + date + ", jobURL=" + jobURL;
}
public String getJobName() {
return jobName;
}
public void setJobName(String jobName) {
this.jobName = jobName;
}
public String getCompany() {
return company;
}
public void setCompany(String company) {
this.company = company;
}
public String getAddress() {
return address;
}
public void setAddress(String address) {
this.address = address;
}
public String getSalary() {
return salary;
}
public void setSalary(String salary) {
this.salary = salary;
}
public String getDate() {
return date;
}
public void setDate(String date) {
this.date = date;
}
public String getJobURL() {
return jobURL;
}
public void setJobURL(String jobURL) {
this.jobURL = jobURL;
}
}
然后写一个用于保存容器的工具类,这样在任何阶段都可以保存容器
import java.io.*;
import java.util.*;
/**实现
* 1。将JobBean容器存入本地
* 2.从本地文件读入文件为JobBean容器(有筛选)
* @author PowerZZJ
*
*/
public class JobBeanUtils {
/**保存JobBean到本地功能实现
* @param job
*/
public static void saveJobBean(JobBean job) {
try(BufferedWriter bw =
new BufferedWriter(
new FileWriter("JobInfo.txt",true))){
String jobInfo = job.toString();
bw.write(jobInfo);
bw.newLine();
bw.flush();
}catch(Exception e) {
System.out.println("保存JobBean失败");
e.printStackTrace();
}
}
/**保存JobBean容器到本地功能实现
* @param jobBeanList JobBean容器
*/
public static void saveJobBeanList(List<JobBean> jobBeanList) {
System.out.println("正在备份容器到本地");
for(JobBean jobBean : jobBeanList) {
saveJobBean(jobBean);
}
System.out.println("备份完成,一共"+jobBeanList.size()+"条信息");
}
/**从本地文件读入文件为JobBean容器(有筛选)
* @return jobBean容器
*/
public static List<JobBean> loadJobBeanList(){
List<JobBean> jobBeanList = new ArrayList<>();
try(BufferedReader br =
new BufferedReader(
new FileReader("JobInfo.txt"))){
String str = null;
while((str=br.readLine())!=null) {
//筛选,有些公司名字带有","不规范,直接跳过
try {
String[] datas = str.split(",");
String jobName = datas[0].substring(8);
String company = datas[1].substring(9);
String address = datas[2].substring(9);
String salary = datas[3].substring(8);
String date = datas[4].substring(6);
String jobURL = datas[5].substring(8);
//筛选,全部都不为空,工资是个区间,URL以https开头,才建立JobBean
if (jobName.equals("") || company.equals("") || address.equals("") || salary.equals("")
|| !(salary.contains("-"))|| date.equals("") || !(jobURL.startsWith("http")))
continue;
JobBean jobBean = new JobBean(jobName, company, address, salary, date, jobURL);
//放入容器
jobBeanList.add(jobBean);
}catch(Exception e) {
System.out.println("本地读取筛选:有问题需要跳过的数据行:"+str);
continue;
}
}
System.out.println("读取完成,一共读取"+jobBeanList.size()+"条信息");
return jobBeanList;
}catch(Exception e) {
System.out.println("读取JobBean失败");
e.printStackTrace();
}
return jobBeanList;
}
}
接着就是关键的爬取了
标签是el 里面是需要的信息,以及第一个el出来的是总体信息,一会需要去除。
各自里面都有t1,t2,t3,t4,t5标签,按照顺序一个个取出来就好。
再查看"下一页"元素,在bk标签下,这里要注意,有两个bk,第一个bk是上一页,第二个bk才是下一页,
之前我爬取进入死循环了。。。。
最后一个spider功能把爬取信息以及迭代下一页全部都放在一起
import java.net.URL;
import java.util.ArrayList;
import java.util.List;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
/**爬取网页信息
* @author PowerZZJ
*
*/
public class Spider {
//记录爬到第几页
private static int pageCount = 1;
private String strURL;
private String nextPageURL;
private Document document;//网页全部信息
private List<JobBean> jobBeanList;
public Spider(String strURL) {
this.strURL = strURL;
nextPageURL = strURL;//下一页URL初始化为当前,方便遍历
jobBeanList = new ArrayList<JobBean>();
}
/**获取网页全部信息
* @param 网址
* @return 网页全部信息
*/
public Document getDom(String strURL) {
try {
URL url = new URL(strURL);
//解析,并设置超时
document = Jsoup.parse(url, 4000);
return document;
}catch(Exception e) {
System.out.println("getDom失败");
e.printStackTrace();
}
return null;
}
/**筛选当前网页信息,转成JobBean对象,存入容器
* @param document 网页全部信息
*/
public void getPageInfo(Document document) {
//通过CSS选择器用#resultList .el获取el标签信息
Elements elements = document.select("#resultList .el");
//总体信息删去
elements.remove(0);
//筛选信息
for(Element element: elements) {
Elements elementsSpan = element.select("span");
String jobURL = elementsSpan.select("a").attr("href");
String jobName = elementsSpan.get(0).select("a").attr("title");
String company = elementsSpan.get(1).select("a").attr("title");
String address = elementsSpan.get(2).text();
String salary = elementsSpan.get(3).text();
String date = elementsSpan.get(4).text();
//建立JobBean对象
JobBean jobBean = new JobBean(jobName, company, address, salary, date, jobURL);
//放入容器
jobBeanList.add(jobBean);
}
}
/**获取下一页的URL
* @param document 网页全部信息
* @return 有,则返回URL
*/
public String getNextPageURL(Document document) {
try {
Elements elements = document.select(".bk");
//第二个bk才是下一页
Element element = elements.get(1);
nextPageURL = element.select("a").attr("href");
if(nextPageURL != null) {
System.out.println("---------"+(pageCount++)+"--------");
return nextPageURL;
}
}catch(Exception e) {
System.out.println("获取下一页URL失败");
e.printStackTrace();
}
return null;
}
/**开始爬取
*
*/
public void spider() {
while(!nextPageURL.equals("")) {
//获取全部信息
document = getDom(nextPageURL);
//把相关信息加入容器
getPageInfo(document);
//查找下一页的URL
nextPageURL = getNextPageURL(document);
}
}
//获取JobBean容器
public List<JobBean> getJobBeanList() {
return jobBeanList;
}
}
然后测试一下爬取与保存功能
import java.util.ArrayList;
import java.util.List;
public class Test1 {
public static void main(String[] args) {
List<JobBean> jobBeanList = new ArrayList<>();
//大数据+上海
String strURL = "https://search.51job.com/list/020000,000000,0000,00,9,99,%25E5%25A4%25A7%25E6%2595%25B0%25E6%258D%25AE,2,1.html?lang=c&stype=1&postchannel=0000&workyear=99&cotype=99°reefrom=99&jobterm=99&companysize=99&lonlat=0%2C0&radius=-1&ord_field=0&confirmdate=9&fromType=&dibiaoid=0&address=&line=&specialarea=00&from=&welfare=";
//测试Spider以及保存
Spider spider = new Spider(strURL);
spider.spider();
//获取爬取后的JobBean容器
jobBeanList = spider.getJobBeanList();
//调用JobBean工具类保存JobBeanList到本地
JobBeanUtils.saveJobBeanList(jobBeanList);
//调用JobBean工具类从本地筛选并读取,得到JobBeanList
jobBeanList = JobBeanUtils.loadJobBeanList();
}
}
然后本地就有了JobInfo.txt
然后就是把JobBean容器放到MySQL中了,我的数据库名字是51job,表名字是jobInfo,所有属性都是字符串,emmm就字符串吧
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.SQLException;
public class ConnectMySQL {
//数据库信息
private static final String DBaddress = "jdbc:mysql://localhost/51job?serverTimezone=UTC";
private static final String userName = "root";
private static final String password = "Woshishabi2813";
private Connection conn;
//加载驱动,连接数据库
public ConnectMySQL() {
LoadDriver();
//连接数据库
try {
conn = DriverManager.getConnection(DBaddress, userName, password);
} catch (SQLException e) {
System.out.println("数据库连接失败");
}
}
//加载驱动
private void LoadDriver() {
try {
Class.forName("com.mysql.cj.jdbc.Driver");
System.out.println("加载驱动成功");
} catch (Exception e) {
System.out.println("驱动加载失败");
}
}
//获取连接
public Connection getConn() {
return conn;
}
}
接着就是数据相关操作的工具类的编写了。
import java.sql.Connection;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
import java.util.ArrayList;
import java.util.List;
public class DBUtils {
/**将JobBean容器存入数据库(有筛选)
* @param conn 数据库的连接
* @param jobBeanList jobBean容器
*/
public static void insert(Connection conn, List<JobBean> jobBeanList) {
System.out.println("正在插入数据");
PreparedStatement ps;
for(JobBean j: jobBeanList) {
//命令生成
String command = String.format("insert into jobInfo values('%s','%s','%s','%s','%s','%s')",
j.getJobName(),
j.getCompany(),
j.getAddress(),
j.getSalary(),
j.getDate(),
j.getJobURL());
try {
ps = conn.prepareStatement(command);
ps.executeUpdate();
} catch (Exception e) {
System.out.println("存入数据库筛选有误信息:"+j.getJobName());
}
}
System.out.println("插入数据完成");
}
/**将JobBean容器,取出
* @param conn 数据库的连接
* @return jobBean容器
*/
public static List<JobBean> select(Connection conn){
PreparedStatement ps;
ResultSet rs;
List<JobBean> jobBeanList = new ArrayList<JobBean>();
String command = "select * from jobInfo";
try {
ps = conn.prepareStatement(command);
rs = ps.executeQuery();
int col = rs.getMetaData().getColumnCount();
while(rs.next()) {
JobBean jobBean = new JobBean(rs.getString(1),
rs.getString(2),
rs.getString(3),
rs.getString(4),
rs.getString(5),
rs.getString(6));
jobBeanList.add(jobBean);
}
return jobBeanList;
} catch (Exception e) {
System.out.println("数据库查询失败");
}
return null;
}
}
然后测试一下
import java.sql.Connection;
import java.util.ArrayList;
import java.util.List;
public class Test2 {
public static void main(String[] args) {
List<JobBean> jobBeanList = new ArrayList<>();
jobBeanList = JobBeanUtils.loadJobBeanList();
//数据库测试
ConnectMySQL cm = new ConnectMySQL();
Connection conn = cm.getConn();
//插入测试
DBUtils.insert(conn, jobBeanList);
//select测试
jobBeanList = DBUtils.select(conn);
for(JobBean j: jobBeanList) {
System.out.println(j);
}
}
}
上面的图可以看到虽然是“大数据+上海”,但是依旧有运维工程师上面不相关的,后面会进行过滤处理。这里就先存入数据库中
先来个功能的整体测试,删除JobInfo.txt,重建数据库
import java.sql.Connection;
import java.util.ArrayList;
import java.util.List;
public class TestMain {
public static void main(String[] args) {
List<JobBean> jobBeanList = new ArrayList<>();
//大数据+上海
String strURL = "https://search.51job.com/list/020000,000000,0000,00,9,99,%25E5%25A4%25A7%25E6%2595%25B0%25E6%258D%25AE,2,1.html?lang=c&stype=1&postchannel=0000&workyear=99&cotype=99°reefrom=99&jobterm=99&companysize=99&lonlat=0%2C0&radius=-1&ord_field=0&confirmdate=9&fromType=&dibiaoid=0&address=&line=&specialarea=00&from=&welfare=";
// //Java+上海
// String strURL = "https://search.51job.com/list/020000,000000,0000,00,9,99,java,2,1.html?lang=c&stype=&postchannel=0000&workyear=99&cotype=99°reefrom=99&jobterm=99&companysize=99&providesalary=99&lonlat=0%2C0&radius=-1&ord_field=0&confirmdate=9&fromType=&dibiaoid=0&address=&line=&specialarea=00&from=&welfare=";
//所有功能测试
//爬取的对象
Spider jobSpider = new Spider(strURL);
jobSpider.spider();
//爬取完的JobBeanList
jobBeanList = jobSpider.getJobBeanList();
//调用JobBean工具类保存JobBeanList到本地
JobBeanUtils.saveJobBeanList(jobBeanList);
//调用JobBean工具类从本地筛选并读取,得到JobBeanList
jobBeanList = JobBeanUtils.loadJobBeanList();
//连接数据库,并获取连接
ConnectMySQL cm = new ConnectMySQL();
Connection conn = cm.getConn();
//调用数据库工具类将JobBean容器存入数据库
DBUtils.insert(conn, jobBeanList);
// //调用数据库工具类查询数据库信息,并返回一个JobBeanList
// jobBeanList = DBUtils.select(conn);
//
// for(JobBean j: jobBeanList) {
// System.out.println(j);
// }
}
}
这些功能都是能独立使用的,不是一定要这样一路写下来。
接下来就是进行数据库的读取,进行简单的过滤,然后进行分析了
先上思维导图
首先是过滤关键字和日期
import java.util.ArrayList;
import java.util.Calendar;
import java.util.List;public class BaseFilter {
private List<JobBean> jobBeanList;
//foreach遍历不可以remove,Iterator有锁
//用新的保存要删除的,然后removeAll
private List<JobBean> removeList;
public BaseFilter(List<JobBean> jobBeanList) {
this.jobBeanList = new ArrayList<JobBean>();
removeList = new ArrayList<JobBean>();
//引用同一个对象,getJobBeanList有没有都一样
this.jobBeanList = jobBeanList;
printNum();
}
//打印JobBean容器中的数量
public void printNum() {
System.out.println("现在一共"+jobBeanList.size()+"条数据");
}
/**筛选职位名字
* @param containJobName 关键字保留
*/
public void filterJobName(String containJobName) {
for(JobBean j: jobBeanList) {
if(!j.getJobName().contains(containJobName)) {
removeList.add(j);
}
}
jobBeanList.removeAll(removeList);
removeList.clear();
printNum();
}
/**筛选日期,要当天发布的
* @param
*/
public void filterDate() {
Calendar now=Calendar.getInstance();
int nowMonth = now.get(Calendar.MONTH)+1;
int nowDay = now.get(Calendar.DATE);
for(JobBean j: jobBeanList) {
String[] date = j.getDate().split("-");
int jobMonth = Integer.valueOf(date[0]);
int jobDay = Integer.valueOf(date[1]);
if(!(jobMonth==nowMonth && jobDay==nowDay)) {
removeList.add(j);
}
}
jobBeanList.removeAll(removeList);
removeList.clear();
printNum();
}
public List<JobBean> getJobBeanList(){
return jobBeanList;
}
}
测试一下过滤的效果
import java.sql.Connection;
import java.util.ArrayList;
import java.util.List;
public class Test3 {
public static void main(String[] args) {
List<JobBean> jobBeanList = new ArrayList<>();
//数据库读取jobBean容器
ConnectMySQL cm = new ConnectMySQL();
Connection conn = cm.getConn();
jobBeanList = DBUtils.select(conn);
BaseFilter bf = new BaseFilter(jobBeanList);
//过滤时间
bf.filterDate();
//过滤关键字
bf.filterJobName("数据");
bf.filterJobName("分析");
for(JobBean j: jobBeanList) {
System.out.println(j);
}
}
}
到这里基本是统一的功能,后面的分析就要按照不同职业,或者不同需求而定了,不过基本差不多,
这里分析的就是“大数据+上海”下的相关信息了,为了数据量大一点,关键字带有"数据"就行,有247条信息
用到了tablesaw的包,这个我看有人推荐,结果中间遇到问题都基本百度不到,只有官方文档,反复看了,而且这个还不能单独画出图,
还要别的依赖包,所以我就做个表格吧。。。可视化什么的已经不想研究了(我为什么不用python啊。。。)
分析也就没有什么面向对象需要写的了,基本就是一个main里面一路写下去了。具体用法可以看官方文档,就当看个结果了解一下
工资统一为万/月
import static tech.tablesaw.aggregate.AggregateFunctions.*;
import java.sql.Connection;
import java.util.ArrayList;
import java.util.List;
import tech.tablesaw.api.*;
public class Analayze {
public static void main(String[] args) {
List<JobBean> jobBeanList = new ArrayList<>();
ConnectMySQL cm = new ConnectMySQL();
Connection conn = cm.getConn();
jobBeanList = DBUtils.select(conn);
BaseFilter bf = new BaseFilter(jobBeanList);
bf.filterDate();
bf.filterJobName("数据");
int nums = jobBeanList.size();
//分析
//按照工资排序
String[] jobNames = new String[nums];
String[] companys = new String[nums];
String[] addresss = new String[nums];
double[] salarys = new double[nums];
String[] jobURLs = new String[nums];
for(int i=0; i<nums; i++) {
JobBean j = jobBeanList.get(i);
String jobName = j.getJobName();
String company = j.getCompany();
//地址提出区名字
String address;
if(j.getAddress().contains("-")) {
address = j.getAddress().split("-")[1];
}else{
address = j.getAddress();
}
//工资统一单位
String sSalary = j.getSalary();
double dSalary;
if(sSalary.contains("万/月")) {
dSalary = Double.valueOf(sSalary.split("-")[0]);
}else if(sSalary.contains("千/月")) {
dSalary = Double.valueOf(sSalary.split("-")[0])/10;
dSalary = (double) Math.round(dSalary * 100) / 100;
}else if(sSalary.contains("万/年")) {
dSalary = Double.valueOf(sSalary.split("-")[0])/12;
dSalary = (double) Math.round(dSalary * 100) / 100;
}else {
dSalary = 0;
System.out.println("工资转换失败");
continue;
}
String jobURL = j.getJobURL();
jobNames[i] = jobName;
companys[i] = company;
addresss[i] = address;
salarys[i] = dSalary;
jobURLs[i] = jobURL;
}
Table jobInfo = Table.create("Job Info")
.addColumns(
StringColumn.create("jobName", jobNames),
StringColumn.create("company", companys),
StringColumn.create("address", addresss),
DoubleColumn.create("salary", salarys),
StringColumn.create("jobURL", jobURLs)
);
// System.out.println("全上海信息");
// System.out.println(salaryInfo(jobInfo));
List<Table> addressJobInfo = new ArrayList<>();
//按照地区划分
Table ShanghaiJobInfo = chooseByAddress(jobInfo, "上海");
Table jingAnJobInfo = chooseByAddress(jobInfo, "静安区");
Table puDongJobInfo = chooseByAddress(jobInfo, "浦东新区");
Table changNingJobInfo = chooseByAddress(jobInfo, "长宁区");
Table minHangJobInfo = chooseByAddress(jobInfo, "闵行区");
Table xuHuiJobInfo = chooseByAddress(jobInfo, "徐汇区");
//人数太少
// Table songJiangJobInfo = chooseByAddress(jobInfo, "松江区");
// Table yangPuJobInfo = chooseByAddress(jobInfo, "杨浦区");
// Table hongKouJobInfo = chooseByAddress(jobInfo, "虹口区");
// Table OtherInfo = chooseByAddress(jobInfo, "异地招聘");
// Table puTuoJobInfo = chooseByAddress(jobInfo, "普陀区");
addressJobInfo.add(jobInfo);
//上海地区招聘
addressJobInfo.add(ShanghaiJobInfo);
addressJobInfo.add(jingAnJobInfo);
addressJobInfo.add(puDongJobInfo);
addressJobInfo.add(changNingJobInfo);
addressJobInfo.add(minHangJobInfo);
addressJobInfo.add(xuHuiJobInfo);
// addressJobInfo.add(songJiangJobInfo);
// addressJobInfo.add(yangPuJobInfo);
// addressJobInfo.add(hongKouJobInfo);
// addressJobInfo.add(puTuoJobInfo);
// addressJobInfo.add(OtherInfo);
for(Table t: addressJobInfo) {
System.out.println(salaryInfo(t));
}
for(Table t: addressJobInfo) {
System.out.println(sortBySalary(t).first(10));
}
}
//工资平均值,最小,最大
public static Table salaryInfo(Table t) {
return t.summarize("salary",mean,stdDev,median,max,min).apply();
}
//salary进行降序
public static Table sortBySalary(Table t) {
return t.sortDescendingOn("salary");
}
//选择地区
public static Table chooseByAddress(Table t, String address) {
Table t2 = Table.create(address)
.addColumns(
StringColumn.create("jobName"),
StringColumn.create("company"),
StringColumn.create("address"),
DoubleColumn.create("salary"),
StringColumn.create("jobURL"));
for(Row r: t) {
if(r.getString(2).equals(address)) {
t2.addRow(r);
}
}
return t2;
}
}
前半段是各个地区的信息
后半段是各个区工资最高的前10名的信息,可以看到这个tablesaw的表要多难看有多难看。。。
jobURL可以直接在浏览器里面看,
换个URL进行测试
我要找Java开发工作
将之前TestMain中的strURL换成Java+上海
https://search.51job.com/list/020000,000000,0000,00,9,99,java,2,1.html?lang=c&stype=&postchannel=0000&workyear=99&cotype=99°reefrom=99&jobterm=99&companysize=99&providesalary=99&lonlat=0%2C0&radius=-1&ord_field=0&confirmdate=9&fromType=&dibiaoid=0&address=&line=&specialarea=00&from=&welfare=
删除JobInfo.txt,重建数据库
运行,爬了270多页,本地JobInfo.txt
数据库
然后到Analyze中把bf.filterJobName("数据");
改为“Java”,再加一个“开发”,然后运行
信息全部都出来了,分析什么的,先照着表格说一点把。。。
后面想要拓展的内容就是继续爬取jobURL然后把职位要求做统计。这还没做,暑假有兴趣应该会搞一下,
然后可以把数据库设计一下,把工资分为最低和最高两项,存进去就变成double类型,这样以后分析也会轻松一点
原文出处:https://www.cnblogs.com/powerzzjcode/p/10922261.html
来源:oschina
链接:https://my.oschina.net/u/4317749/blog/3263047