Python第三章
一、课程计划
- 网络爬虫:
抓取页面:请求url响应html
HttpClient
解析页面:
使用Jsoup
爬虫框架:
webmagic:
Downloader:下载器
PageProcess:页面解析的业务逻辑
Pipeline:数据持久化。
Scheduler:url队列
课程计划:
爬虫的高级技术:
1)定时器
2)代理的使用
3)selenium+无头浏览器
4)综合案例
二、定时器
-
Timer
-
Quartz:定时框架
功能强大,使用繁琐。-
package cn.sgwks.crawler; import org.springframework.boot.SpringApplication; import org.springframework.boot.autoconfigure.SpringBootApplication; import org.springframework.scheduling.annotation.EnableScheduling; @SpringBootApplication //开启定时器 @EnableScheduling public class ChromeApplication { public static void main(String[] args) { SpringApplication.run(ChromeApplication.class, args); } }
-
-
spring中使用定时
也是使用Quartz框架。
在Springboot框架中使用定时器:
1)@Scheduled
在需要定期执行的方法上添加此注解。
2)在Springboot的引导类
@EnableScheduling
创建springboot工程:
1)工程必须是maven工程
2)工程必须继承spring-boot-stater-parent工程
3)添加起步依赖
4)application.yml(properties)
5)引导类,其中包含main方法-
package cn.sgwks.crawler.test; import org.springframework.scheduling.annotation.Scheduled; import org.springframework.stereotype.Component; import java.util.Date; @Component public class SchedulerTest { @Scheduled( /** * fixedDelay:固定延迟,固定延迟多长时间执行。long类型 * fixedDelayString:固定延迟执行,同fixedDelay,数据类型是String类型。 * fixedRate:固定周期执行 * fixedRateString:字符串类型的值,使用方法同fixedRate */ //fixedDelay = 1000 //fixedRate = 3000 //cron = "0/5 * * * * ? " cron = "0,2,5 * * * * ? " ) public void printTime() { System.out.println(new Date().toLocaleString()); } }
-
-
需求
定时向控制台输出当前时间
@Scheduled
fixedDelay:固定延迟,固定延迟多长时间执行。long类型
fixedDelayString:固定延迟执行,同fixedDelay,数据类型是String类型。
fixedRate:固定周期执行
fixedRateString:字符串类型的值,使用方法同fixedRate
复杂的周期执行应该使用cron表达式:
cron属性的值就是cron表达式,就是一个字符串。
在此注解中不支持年份,表达式只能是6段
三、代理的使用
-
应用场景
防止服务器识别出爬虫。 -
获得代理服务器
可以找一些免费的代理服务器
米扑代理
https://proxy.mimvp.com/free.php
西刺免费代理IP
http://www.xicidaili.com/ -
使用方法
在webmagic框架中使用代理。
应该创建一个Downloader对象,配置代理服务器。
1、创建一个PageProcessor对象。 2、创建一个Downloader对象,可以使用HttpClientDownloader
3、在Downloader对象中配置代理服务器。
4、使用Spider类组装爬虫。
5、执行爬虫-
package cn.sgwks.crawler.test; import us.codecraft.webmagic.Page; import us.codecraft.webmagic.Site; import us.codecraft.webmagic.Spider; import us.codecraft.webmagic.downloader.HttpClientDownloader; import us.codecraft.webmagic.processor.PageProcessor; import us.codecraft.webmagic.proxy.Proxy; import us.codecraft.webmagic.proxy.ProxyProvider; import us.codecraft.webmagic.proxy.SimpleProxyProvider; public class MyPageProcessor implements PageProcessor { @Override public void process(Page page) { String html = page.getHtml().get(); page.putField("html", html); } @Override public Site getSite() { return Site.me(); } public static void main(String[] args) { //创建一个Downloader组件 HttpClientDownloader downloader = new HttpClientDownloader(); ProxyProvider proxyProvider = SimpleProxyProvider.from( new Proxy("182.61.179.157",8888) ); downloader.setProxyProvider(proxyProvider); Spider.create(new MyPageProcessor()) //设置自定义Downloader组件 .setDownloader(downloader) .addUrl("https://www.jd.com/") .start(); } }
-
四、使用selenium+无头浏览器
-
selenium
前端测试框架
java
.net
python
node.js通过代码控制浏览器。
-
package cn.sgwks.crawler.test; import org.openqa.selenium.chrome.ChromeDriver; import org.openqa.selenium.chrome.ChromeOptions; import org.openqa.selenium.remote.RemoteWebDriver; import org.springframework.stereotype.Component; @Component public class ChromeTest { public static void main(String[] args) { //创建配置参数 System.setProperty("webdriver.chrome.driver", "C:\\Users\\acer\\AppData\\Local\\Google\\Chrome\\Application\\chromedriver.exe"); ChromeOptions chromeOptions = new ChromeOptions(); //设置为 headless 模式 (测试不必须) //chromeOptions.addArguments("--headless"); //设置浏览器窗口打开大小 (非必须) chromeOptions.addArguments("--window-size=1024,768"); //创建WebDriver对象,采用了多态 RemoteWebDriver webDriver = new ChromeDriver(chromeOptions); //使用WebDriver控制浏览器 webDriver.get("https://www.jd.com/"); String title = webDriver.getTitle(); String h1Name = webDriver.findElementByCssSelector("#logo > h1 > a").getText(); System.out.println(title);//京东(JD.COM)-正品低价、品质保障、配送及时、轻松购物! System.out.println(h1Name);//京东 //线程睡眠5秒后将关闭浏览器 try { Thread.sleep(2000); } catch (InterruptedException e) { e.printStackTrace(); } webDriver.close(); } }
-
-
无头浏览器
没有图像界面的浏览器。
phantomjs:无头浏览器,不再更新。将来会被淘汰。(了解)
普通浏览器的无头浏览模式:
chrome(推荐使用)
Firefox-
package cn.sgwks.crawler.test; import org.openqa.selenium.WebElement; import org.openqa.selenium.chrome.ChromeDriver; import org.openqa.selenium.chrome.ChromeOptions; import org.openqa.selenium.remote.RemoteWebDriver; import org.springframework.stereotype.Component; import java.util.List; @Component public class ChromeTest2 { public static void main(String[] args) { //创建配置参数 System.setProperty("webdriver.chrome.driver", "C:\\Users\\acer\\AppData\\Local\\Google\\Chrome\\Application\\chromedriver.exe"); ChromeOptions chromeOptions = new ChromeOptions(); //设置为 headless 模式 (测试不必须) //chromeOptions.addArguments("--headless"); //设置浏览器窗口打开大小 (非必须) chromeOptions.addArguments("--window-size=1024,768"); //创建WebDriver对象,采用了多态 RemoteWebDriver webDriver = new ChromeDriver(chromeOptions); //使用WebDriver控制浏览器 webDriver.get("https://search.jd.com/Search?keyword=%E6%89%8B%E6%9C%BA&enc=utf-8" + "&suggest=1.his.0.0&wq=&pvid=86555b140ee64a70a9cdf1d5c7b836f4"); webDriver.executeScript("window.scrollTo(0, document.body.scrollHeight - 300)"); //线程睡眠5秒后将关闭浏览器 try { Thread.sleep(2000); } catch (InterruptedException e) { e.printStackTrace(); } //选择商品列表 List<WebElement> list = webDriver.findElementsByCssSelector("li.gl-item"); System.out.println(list.size()); webDriver.close(); } }
-
-
chrome
1)先安装chrome浏览器
2)然后安装chrome浏览器的驱动selenium的驱动,应该放到chrome浏览器所在的目录下。
3)编写代码
1、向工程中添加jar包。selenium的jar包即可。
2、创建浏览器的配置参数。
3、创建一个WebDriver对象,代表浏览器。
4、使用WebDriver对象控制浏览器。-
package cn.sgwks.crawler.test; import org.openqa.selenium.WebElement; import org.openqa.selenium.chrome.ChromeDriver; import org.openqa.selenium.chrome.ChromeOptions; import org.openqa.selenium.remote.RemoteWebDriver; import org.springframework.stereotype.Component; import java.util.List; @Component public class ChromeTest3 { public static void main(String[] args) { //创建配置参数 System.setProperty("webdriver.chrome.driver", "C:\\Users\\acer\\AppData\\Local\\Google\\Chrome\\Application\\chromedriver.exe"); ChromeOptions chromeOptions = new ChromeOptions(); //设置为 headless 模式 (测试不必须) //chromeOptions.addArguments("--headless"); //设置浏览器窗口打开大小 (非必须) chromeOptions.addArguments("--window-size=1024,768"); //创建WebDriver对象,采用了多态 RemoteWebDriver webDriver = new ChromeDriver(chromeOptions); //使用WebDriver控制浏览器 webDriver.get("https://www.jd.com/"); //在文本框内输入手机 webDriver.findElementByCssSelector("#key").sendKeys("手机"); try { Thread.sleep(2000); } catch (InterruptedException e) { e.printStackTrace(); } //点击搜索按钮 webDriver.findElementByCssSelector("#search > div > div.form > button").click(); try { Thread.sleep(2000); } catch (InterruptedException e) { e.printStackTrace(); } webDriver.executeScript("window.scrollTo(0, document.body.scrollHeight - 300)"); //线程睡眠5秒后将关闭浏览器 try { Thread.sleep(2000); } catch (InterruptedException e) { e.printStackTrace(); } //选择商品列表 List<WebElement> list = webDriver.findElementsByCssSelector("li.gl-item"); System.out.println(list.size()); /*String html = webDriver.getPageSource(); System.out.println(html);*/ webDriver.close(); } }
-
-
扩展,爬去LOL角色详情的图片,并下载到本地
-
工具类HttpsUtils
package cn.sgwks.crawler.test; import org.apache.http.config.Registry; import org.apache.http.config.RegistryBuilder; import org.apache.http.conn.socket.ConnectionSocketFactory; import org.apache.http.conn.socket.PlainConnectionSocketFactory; import org.apache.http.conn.ssl.NoopHostnameVerifier; import org.apache.http.conn.ssl.SSLConnectionSocketFactory; import org.apache.http.conn.ssl.TrustStrategy; import org.apache.http.impl.client.CloseableHttpClient; import org.apache.http.impl.client.HttpClients; import org.apache.http.impl.conn.PoolingHttpClientConnectionManager; import org.apache.http.ssl.SSLContextBuilder; import java.security.cert.CertificateException; import java.security.cert.X509Certificate; public class HttpsUtils { private static final String HTTP = "http"; private static final String HTTPS = "https"; private static SSLConnectionSocketFactory sslsf = null; private static PoolingHttpClientConnectionManager cm = null; private static SSLContextBuilder builder = null; static { try { builder = new SSLContextBuilder(); // 全部信任 不做身份鉴定 builder.loadTrustMaterial(null, new TrustStrategy() { @Override public boolean isTrusted(X509Certificate[] x509Certificates, String s) throws CertificateException { return true; } }); sslsf = new SSLConnectionSocketFactory(builder.build(), new String[]{"SSLv2Hello", "SSLv3", "TLSv1", "TLSv1.2"}, null, NoopHostnameVerifier.INSTANCE); Registry<ConnectionSocketFactory> registry = RegistryBuilder.<ConnectionSocketFactory>create() .register(HTTP, new PlainConnectionSocketFactory()) .register(HTTPS, sslsf) .build(); cm = new PoolingHttpClientConnectionManager(registry); cm.setMaxTotal(200);//max connection } catch (Exception e) { e.printStackTrace(); } } public static CloseableHttpClient getHttpClient() throws Exception { CloseableHttpClient httpClient = HttpClients.custom() .setSSLSocketFactory(sslsf) .setConnectionManager(cm) .setConnectionManagerShared(true) .build(); return httpClient; } }
-
package cn.sgwks.crawler.test; import org.apache.http.HttpEntity; import org.apache.http.client.methods.CloseableHttpResponse; import org.apache.http.client.methods.HttpGet; import org.apache.http.impl.client.CloseableHttpClient; import org.openqa.selenium.WebElement; import org.openqa.selenium.chrome.ChromeDriver; import org.openqa.selenium.chrome.ChromeOptions; import org.openqa.selenium.remote.RemoteWebDriver; import org.springframework.stereotype.Component; import java.io.FileOutputStream; import java.util.List; @Component public class ChromeTestLol{ public static void chromeShow() { //创建配置参数 System.setProperty("webdriver.chrome.driver", "C:\\Users\\acer\\AppData\\Local\\Google\\Chrome\\Application\\chromedriver.exe"); ChromeOptions chromeOptions = new ChromeOptions(); //设置为 headless 模式 (测试不必须) chromeOptions.addArguments("--headless"); //设置浏览器窗口打开大小 (非必须) chromeOptions.addArguments("--window-size=1024,768"); //创建WebDriver对象,采用了多态 RemoteWebDriver webDriver = new ChromeDriver(chromeOptions); //使用WebDriver控制浏览器 webDriver.get("https://lol.qq.com/data/info-heros.shtml"); try { Thread.sleep(1000); } catch (InterruptedException e) { e.printStackTrace(); } webDriver.executeScript("window.scrollTo(0, document.body.scrollHeight - 200)"); List<WebElement> webElements = webDriver.findElementsByCssSelector("#jSearchHeroDiv > li > a > img"); System.out.println(webElements.size()); for (WebElement webElement : webElements) { String image = webElement.getAttribute("src"); String imageName = webElement.getAttribute("alt"); downPic(image, imageName); } webDriver.close(); } public static void downPic(String image, String imageName) { try { //这个用了连接池对象 CloseableHttpClient httpClient = HttpsUtils.getHttpClient(); //默认初始化 //CloseableHttpClient httpClient = HttpClients.createDefault(); HttpGet get = new HttpGet(image); CloseableHttpResponse response = httpClient.execute(get); HttpEntity entity = response.getEntity(); String extName = image.substring(image.lastIndexOf(".")); String fileName = imageName + extName; FileOutputStream fos = new FileOutputStream("C:\\Users\\acer\\Desktop\\lol\\"+fileName); entity.writeTo(fos); response.close(); //用了连接池就不需要关闭客户端连接,否则还需要创建 //httpClient.close(); } catch (Exception e) { e.printStackTrace(); } } public static void main(String[] args) { chromeShow(); } }
-
五、综合案例
-
实现京东商城数据的抓取。每页要求抓取60条数据。
-
分析
1)访问搜索的url
2)页面加载完成之后,执行页面滚动,取后30条数据。
3)从列表页面取spu、sku,保存到数据库
4)从列表页面取详情页面的url列表,添加到队列中。
5)翻页处理
http://nextpage.com?url=前一页的url
6)如果是详情页面解析页面中的商品信息,更新到数据库,根据sku更新。
7)数据的持久化
自定义pipeline,实现数据的保存。
Springboot+SpringDataJpa -
工程搭建
springboot工程
添加web、jap起步依赖。-
<?xml version="1.0" encoding="UTF-8"?> <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>cn.sgwks</groupId> <artifactId>crawlerchrome-jd</artifactId> <version>1.0-SNAPSHOT</version> <parent> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-parent</artifactId> <version>2.0.2.RELEASE</version> </parent> <properties> <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> <maven.compiler.source>1.8</maven.compiler.source> <maven.compiler.target>1.8</maven.compiler.target> </properties> <dependencies> <!--WebMagic核心包--> <dependency> <groupId>us.codecraft</groupId> <artifactId>webmagic-core</artifactId> <version>0.7.3</version> <exclusions> <exclusion> <groupId>org.slf4j</groupId> <artifactId>slf4j-log4j12</artifactId> </exclusion> </exclusions> </dependency> <!--WebMagic扩展--> <dependency> <groupId>us.codecraft</groupId> <artifactId>webmagic-extension</artifactId> <version>0.7.3</version> </dependency> <!--工具包--> <dependency> <groupId>org.apache.commons</groupId> <artifactId>commons-lang3</artifactId> </dependency> <!--SpringMVC--> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-web</artifactId> </dependency> <!--SpringData Jpa--> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-data-jpa</artifactId> </dependency> <!--单元测试--> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-test</artifactId> </dependency> <!--MySQL连接包--> <dependency> <groupId>mysql</groupId> <artifactId>mysql-connector-java</artifactId> </dependency> <dependency> <groupId>org.seleniumhq.selenium</groupId> <artifactId>selenium-java</artifactId> <version>3.13.0</version> </dependency> </dependencies> </project>
创建实体类、dao
-
package cn.sgwks.crawlerjd.entity; import javax.persistence.*; import java.util.Date; @Entity @Table(name = "jd_item") public class Item { @Id @GeneratedValue(strategy = GenerationType.IDENTITY) private Long id; private Long spu; private Long sku; private String title; private Float price; private String pic; private String url; private Date created; private Date updated; public Long getId() { return id; } public void setId(Long id) { this.id = id; } public Long getSpu() { return spu; } public void setSpu(Long spu) { this.spu = spu; } public Long getSku() { return sku; } public void setSku(Long sku) { this.sku = sku; } public String getTitle() { return title; } public void setTitle(String title) { this.title = title; } public Float getPrice() { return price; } public void setPrice(Float price) { this.price = price; } public String getPic() { return pic; } public void setPic(String pic) { this.pic = pic; } public String getUrl() { return url; } public void setUrl(String url) { this.url = url; } public Date getCreated() { return created; } public void setCreated(Date created) { this.created = created; } public Date getUpdated() { return updated; } public void setUpdated(Date updated) { this.updated = updated; } }
-
package cn.sgwks.crawlerjd.dao; import cn.sgwks.crawlerjd.entity.Item; import org.springframework.data.jpa.repository.JpaRepository; public interface ItemDao extends JpaRepository<Item,Long> { Item findBySku(Long sku); }
编写配置文件和引导类。
-
#DB Configuration: spring: datasource: driver-class-name: com.mysql.jdbc.Driver url: jdbc:mysql://127.0.0.1:3306/crawler-sgw?useUnicode=true&characterEncoding=utf8 username: root password: root #JPA Configuration: jpa: database: mysql show-sql: true generate-ddl: true hibernate: ddl-auto: update
-
package cn.sgwks.crawlerjd; import org.springframework.boot.SpringApplication; import org.springframework.boot.autoconfigure.SpringBootApplication; import org.springframework.scheduling.annotation.EnableScheduling; @SpringBootApplication @EnableScheduling public class SGWApplication { public static void main(String[] args) { SpringApplication.run(SGWApplication.class, args); } }
-
package cn.sgwks.crawlerjd.component; import org.springframework.beans.factory.annotation.Autowired; import org.springframework.scheduling.annotation.Scheduled; import org.springframework.stereotype.Component; import us.codecraft.webmagic.Spider; import us.codecraft.webmagic.pipeline.Pipeline; import us.codecraft.webmagic.processor.PageProcessor; import us.codecraft.webmagic.proxy.Proxy; import us.codecraft.webmagic.proxy.ProxyProvider; import us.codecraft.webmagic.proxy.SimpleProxyProvider; @Component public class JdSpider { private String startUrl = "https://search.jd.com/Search?keyword=%E6%89%8B%E6%9C%BA&enc=utf-8" + "&pvid=b618725e7d6846fd98c41d3b55dbf38c"; //private String startUrl = "https://www.jd.com/"; @Autowired private PageProcessor pageProcessor; @Autowired private Pipeline pipeline; @Autowired private JdHttpClientDownloader downloader; /** * 定时24小时爬一次 */ @Scheduled(fixedRate = 1000 * 60 * 60 * 24) public void start() { //创建一个Downloader组件 ProxyProvider proxyProvider = SimpleProxyProvider.from( new Proxy("39.137.69.6",80), new Proxy("39.137.69.7",8080), new Proxy("150.138.253.73",808), new Proxy("182.92.113.148",8118), new Proxy("39.137.69.*",80), new Proxy("39.137.69.*",8080), new Proxy("150.138.253.**",808), new Proxy("182.92.113.***",8118) ); downloader.setProxyProvider(proxyProvider); Spider.create(pageProcessor) .setDownloader(downloader) .addPipeline(pipeline) .addUrl(startUrl) .start(); } }
-
-
Downloader
-
初始化的url
直接访问
页面滚动
取渲染之后的html
封装成Page对象 -
商品详情页面的url
直接访问
取渲染之后的html
封装成Page对象返回 -
翻页的url
判断是否url是以“http://nextpage.com”开头
取前一页的url,并访问
点击“下一页”按钮
页面滚动到下方
取渲染之后的html
封装成Page对象返回 -
package cn.sgwks.crawlerjd.component; import org.openqa.selenium.WebElement; import org.openqa.selenium.chrome.ChromeDriver; import org.openqa.selenium.chrome.ChromeOptions; import org.openqa.selenium.remote.RemoteWebDriver; import org.springframework.stereotype.Component; import us.codecraft.webmagic.Page; import us.codecraft.webmagic.Request; import us.codecraft.webmagic.Task; import us.codecraft.webmagic.downloader.HttpClientDownloader; import us.codecraft.webmagic.selector.PlainText; import java.util.List; @Component public class JdHttpClientDownloader extends HttpClientDownloader { private RemoteWebDriver webDriver; public JdHttpClientDownloader(){ //创建配置参数 System.setProperty("webdriver.chrome.driver", "C:\\Users\\acer\\AppData\\Local\\Google\\Chrome\\Application\\chromedriver.exe"); ChromeOptions chromeOptions = new ChromeOptions(); //设置为 headless 模式 (测试不必须) chromeOptions.addArguments("--headless"); //设置浏览器窗口打开大小 (非必须) chromeOptions.addArguments("--window-size=1024,768"); //创建WebDriver对象 webDriver = new ChromeDriver(chromeOptions); } public Page download(Request request, Task task) { try { //取url String url = request.getUrl(); //判断是否是分页 的url if (!url.contains("http://nextpage.com")) { //1、初始化的url // 直接访问 webDriver.get(url); List<WebElement> webElementList = webDriver.findElementsByCssSelector("li.gl-item"); //判断是是列表页面 if (webElementList.size() > 0) { //页面滚动 webDriver.executeScript("window.scrollTo(0, document.body.scrollHeight - 300)"); Thread.sleep(1000); //取渲染之后的html String htmlStr = webDriver.getPageSource(); //封装成Page对象 return createPage(htmlStr, url); } else { //2、商品详情页面的url // 直接访问 // 取渲染之后的html String htmlStr = webDriver.getPageSource(); // 封装成Page对象返回 return createPage(htmlStr, url); } } else { //3、翻页的url // 判断是否url是以“http://nextpage.com”开头 // 取前一页的url,并访问 String prePageUrl = (String) request.getExtra("url"); webDriver.get(prePageUrl); // 点击“下一页”按钮 webDriver.findElementByCssSelector("#J_topPage > a.fp-next").click(); Thread.sleep(1000); // 页面滚动到下方 webDriver.executeScript("window.scrollTo(0, document.body.scrollHeight - 300)"); Thread.sleep(1000); // 取渲染之后的html String htmlStr = webDriver.getPageSource(); // 封装成Page对象返回,第二个参数为详情页面的url return createPage(htmlStr, webDriver.getCurrentUrl()); } } catch (Exception e) { e.printStackTrace(); } return Page.fail(); } public void setThread(int thread) { /* compiled code */ } /** * 将html封装成page对象 * @param html * @param url * @return */ private Page createPage(String html, String url) { Page page = new Page(); //给page设置html属性 page.setRawText(html); //设置url page.setUrl(new PlainText(url)); //设置request对象 page.setRequest(new Request(url)); //设置页面抓取成功 page.setDownloadSuccess(true); return page; } }
-
-
PageProcess
1)判断是否是列表页面
2)如果列表页面
3)从列表页面取spu、sku封装到列表中,传递给pipeline
4)取详情页面对象的url列表,添加到访问队列中。
5)创建翻页的url封装成Request对象添加到队列中。
6)如果是详情页面
7)取商品的详细信息,封装到Item对象中,传递给pipeline-
package cn.sgwks.crawlerjd.component; import cn.sgwks.crawlerjd.entity.Item; import cn.sgwks.crawlerjd.utils.HttpsUtils; import org.apache.http.HttpEntity; import org.apache.http.client.methods.CloseableHttpResponse; import org.apache.http.client.methods.HttpGet; import org.apache.http.impl.client.CloseableHttpClient; import org.springframework.stereotype.Component; import us.codecraft.webmagic.Page; import us.codecraft.webmagic.Request; import us.codecraft.webmagic.Site; import us.codecraft.webmagic.processor.PageProcessor; import us.codecraft.webmagic.selector.Html; import us.codecraft.webmagic.selector.Selectable; import java.io.FileOutputStream; import java.util.ArrayList; import java.util.List; import java.util.UUID; /** * 解析数据 */ @Component public class JdPageProcessor implements PageProcessor { @Override public void process(Page page) { //取html对象 Html html = page.getHtml(); List<Selectable> nodes = html.css("li.gl-item").nodes(); //1)判断是否是列表页面 if (nodes.size() > 0) { //2)如果列表页面 //3)从列表页面取spu、sku封装到列表中,传递给pipeline ArrayList<Item> itemList = new ArrayList<>(); for (Selectable node : nodes) { String spu = node.css("li", "data-spu").get(); String sku = node.css("li", "data-sku").get(); //封装到Item对象中 Item item = new Item(); item.setSpu(Long.parseLong(spu)); item.setSku(Long.parseLong(sku)); //添加到列表 itemList.add(item); } //把集合对象添加到序列中 page.putField("itemList", itemList); //4)取详情页面对象的url列表,添加到访问队列中。 List<String> urlList = html.css("li.gl-item div.p-img").links().all(); page.addTargetRequests(urlList); //5)创建翻页的url封装成Request对象添加到队列中。 String nextPageUrl = "http://nextpage.com?url=" + page.getUrl().get(); Request request = new Request(nextPageUrl); request.putExtra("url", page.getUrl().get()); page.addTargetRequest(request); } else { //6)如果是详情页面 //sku String sku = html.css("div.preview-info a.follow.J-follow", "data-id").get(); //商品标题 String title = html.css("div.itemInfo-wrap div.sku-name", "text").get(); //商品价格 String price = html.css("div.dd span.p-price span.price", "text").get(); //商品图片 String picUrl = html.css("#spec-img", "src").get(); String picTitle = html.css("#spec-img", "alt").get(); downloadImage(picUrl, picTitle); //商品的url String itemUrl = page.getUrl().get(); //7)取商品的详细信息,封装到Item对象中,传递给pipeline Item item = new Item(); item.setSku(Long.parseLong(sku)); item.setTitle(title); item.setPrice(Float.parseFloat(price)); item.setPic(picUrl); item.setUrl(itemUrl); //传递给pipeline page.putField("item", item); } } @Override public Site getSite() { return Site.me(); } /** * 图片下载 * @param imgUrl * @return */ private static void downloadImage(String imgUrl, String title) { try { //创建一个HttpClient对象 CloseableHttpClient httpClient = HttpsUtils.getHttpClient(); //创建一个HttpGet对象 HttpGet get = new HttpGet("https:"+imgUrl); //发送请求 CloseableHttpResponse response = httpClient.execute(get); //接收服务端响应的内容。 HttpEntity entity = response.getEntity(); //需要截取扩展名 String extName = imgUrl.substring(imgUrl.lastIndexOf(".")); //需要生成文件名。可以使用uuid生成文件名。并去除特殊字符 String regEx = "[\n`~!@#$%^&*()+=|{}':;',\\[\\].<>/?~!@#¥%……&*()——+|{}【】‘;:”“’。, 、?]"; //需要生成文件名。可以使用uuid生成文件名。 String uuid = UUID.randomUUID().toString().substring(0, 5); String prefix = title.replaceAll(regEx, ""); String fileName = prefix.substring(0, 15) + uuid + extName; //存放地址 C:\Users\acer\Desktop\jdPhone //创建一个文件输出流,把文件保存到磁盘 FileOutputStream fos = new FileOutputStream("C:\\Users\\acer\\Desktop\\jdPhone\\" + fileName); //接收流,把内容保存到磁盘。 entity.writeTo(fos); //关闭流 fos.close(); //关闭Response对象 response.close(); } catch (Exception e) { e.printStackTrace(); } } }
-
工具类 HttpsUtils
package cn.sgwks.crawlerjd.utils; import org.apache.http.config.Registry; import org.apache.http.config.RegistryBuilder; import org.apache.http.conn.socket.ConnectionSocketFactory; import org.apache.http.conn.socket.PlainConnectionSocketFactory; import org.apache.http.conn.ssl.NoopHostnameVerifier; import org.apache.http.conn.ssl.SSLConnectionSocketFactory; import org.apache.http.conn.ssl.TrustStrategy; import org.apache.http.impl.client.CloseableHttpClient; import org.apache.http.impl.client.HttpClients; import org.apache.http.impl.conn.PoolingHttpClientConnectionManager; import org.apache.http.ssl.SSLContextBuilder; import java.security.cert.CertificateException; import java.security.cert.X509Certificate; public class HttpsUtils { private static final String HTTP = "http"; private static final String HTTPS = "https"; private static SSLConnectionSocketFactory sslsf = null; private static PoolingHttpClientConnectionManager cm = null; private static SSLContextBuilder builder = null; static { try { builder = new SSLContextBuilder(); // 全部信任 不做身份鉴定 builder.loadTrustMaterial(null, new TrustStrategy() { @Override public boolean isTrusted(X509Certificate[] x509Certificates, String s) throws CertificateException { return true; } }); sslsf = new SSLConnectionSocketFactory(builder.build(), new String[]{"SSLv2Hello", "SSLv3", "TLSv1", "TLSv1.2"}, null, NoopHostnameVerifier.INSTANCE); Registry<ConnectionSocketFactory> registry = RegistryBuilder.<ConnectionSocketFactory>create() .register(HTTP, new PlainConnectionSocketFactory()) .register(HTTPS, sslsf) .build(); cm = new PoolingHttpClientConnectionManager(registry); cm.setMaxTotal(200);//max connection } catch (Exception e) { e.printStackTrace(); } } public static CloseableHttpClient getHttpClient() throws Exception { CloseableHttpClient httpClient = HttpClients.custom() .setSSLSocketFactory(sslsf) .setConnectionManager(cm) .setConnectionManagerShared(true) .build(); return httpClient; } }
-
-
pipeline
从resultItems对象中取数据
1、取列表数据
2、如果列表数据不为null
3、把列表数据插入到数据库中。
4、取商品数据
5、如果商品数据不为null
6、根据sku查询数据
7、更新数据
8、保存到数据库-
package cn.sgwks.crawlerjd.component; import cn.sgwks.crawlerjd.dao.ItemDao; import cn.sgwks.crawlerjd.entity.Item; import org.springframework.beans.factory.annotation.Autowired; import org.springframework.stereotype.Component; import us.codecraft.webmagic.ResultItems; import us.codecraft.webmagic.Task; import us.codecraft.webmagic.pipeline.Pipeline; import java.util.Date; import java.util.List; /** * 数据持久化组件 */ @Component public class JdPipeline implements Pipeline { @Autowired private ItemDao itemDao; @Override public void process(ResultItems resultItems, Task task) { //从resultItems对象中取数据 //1、取列表数据 List<Item> itemList = resultItems.get("itemList"); //2、如果列表数据不为null if (itemList != null) { //3、把列表数据插入到数据库中。 for (Item item : itemList) { item.setCreated(new Date()); item.setUpdated(new Date()); itemDao.save(item); } } //4、取商品数据 Item item = resultItems.get("item"); //5、如果商品数据不为null if (item != null) { //6、根据sku查询数据 Item item1 = itemDao.findBySku(item.getSku()); //7、更新数据 item1.setTitle(item.getTitle()); item1.setPrice(item.getPrice()); item1.setPic(item.getPic()); item1.setUrl(item.getUrl()); item1.setUpdated(new Date()); //8、保存到数据库 itemDao.save(item1); } } }
-
来源:CSDN
作者:sgwks
链接:https://blog.csdn.net/qq_41821006/article/details/104716494