Python第三章

吃可爱长大的小学妹 提交于 2020-03-07 20:03:40

Python第三章

一、课程计划

  • 网络爬虫:
    抓取页面:请求url响应html
    HttpClient
    解析页面:
    使用Jsoup
    爬虫框架:
    webmagic:
    Downloader:下载器
    PageProcess:页面解析的业务逻辑
    Pipeline:数据持久化。
    Scheduler:url队列
    课程计划:
    爬虫的高级技术:
    1)定时器
    2)代理的使用
    3)selenium+无头浏览器
    4)综合案例

二、定时器

  1. Timer

  2. Quartz:定时框架
    功能强大,使用繁琐。

    • package cn.sgwks.crawler;
      
      import org.springframework.boot.SpringApplication;
      import org.springframework.boot.autoconfigure.SpringBootApplication;
      import org.springframework.scheduling.annotation.EnableScheduling;
      
      @SpringBootApplication
      //开启定时器
      @EnableScheduling
      public class ChromeApplication {
          public static void main(String[] args) {
              SpringApplication.run(ChromeApplication.class, args);
          }
      }
      
  3. spring中使用定时
    也是使用Quartz框架。
    在Springboot框架中使用定时器:
    1)@Scheduled
    在需要定期执行的方法上添加此注解。
    2)在Springboot的引导类
    @EnableScheduling
    创建springboot工程:
    1)工程必须是maven工程
    2)工程必须继承spring-boot-stater-parent工程
    3)添加起步依赖
    4)application.yml(properties)
    5)引导类,其中包含main方法

    • package cn.sgwks.crawler.test;
      
      import org.springframework.scheduling.annotation.Scheduled;
      import org.springframework.stereotype.Component;
      
      import java.util.Date;
      
      @Component
      public class SchedulerTest {
          @Scheduled(
                  /**
                   * fixedDelay:固定延迟,固定延迟多长时间执行。long类型
                   * fixedDelayString:固定延迟执行,同fixedDelay,数据类型是String类型。
                   * fixedRate:固定周期执行
                   * fixedRateString:字符串类型的值,使用方法同fixedRate
                  */
                  //fixedDelay = 1000
                  //fixedRate = 3000
                  //cron = "0/5 * * * * ? "
                  cron = "0,2,5 * * * * ? "
          )
          public void printTime() {
              System.out.println(new Date().toLocaleString());
          }
      }
      
  4. 需求
    定时向控制台输出当前时间
    @Scheduled
    fixedDelay:固定延迟,固定延迟多长时间执行。long类型
    fixedDelayString:固定延迟执行,同fixedDelay,数据类型是String类型。
    fixedRate:固定周期执行
    fixedRateString:字符串类型的值,使用方法同fixedRate
    复杂的周期执行应该使用cron表达式:
    cron属性的值就是cron表达式,就是一个字符串。
    在此注解中不支持年份,表达式只能是6段

三、代理的使用

  1. 应用场景
    防止服务器识别出爬虫。

  2. 获得代理服务器
    可以找一些免费的代理服务器
    米扑代理
    https://proxy.mimvp.com/free.php
    西刺免费代理IP
    http://www.xicidaili.com/

  3. 使用方法
    在webmagic框架中使用代理。
    应该创建一个Downloader对象,配置代理服务器。
    1、创建一个PageProcessor对象。

    ​ 2、创建一个Downloader对象,可以使用HttpClientDownloader
    ​ 3、在Downloader对象中配置代理服务器。
    ​ 4、使用Spider类组装爬虫。
    ​ 5、执行爬虫

    • package cn.sgwks.crawler.test;
      
      import us.codecraft.webmagic.Page;
      import us.codecraft.webmagic.Site;
      import us.codecraft.webmagic.Spider;
      import us.codecraft.webmagic.downloader.HttpClientDownloader;
      import us.codecraft.webmagic.processor.PageProcessor;
      import us.codecraft.webmagic.proxy.Proxy;
      import us.codecraft.webmagic.proxy.ProxyProvider;
      import us.codecraft.webmagic.proxy.SimpleProxyProvider;
      
      public class MyPageProcessor implements PageProcessor {
      
          @Override
          public void process(Page page) {
              String html = page.getHtml().get();
              page.putField("html", html);
          }
      
          @Override
          public Site getSite() {
              return Site.me();
          }
      
          public static void main(String[] args) {
              //创建一个Downloader组件
              HttpClientDownloader downloader = new HttpClientDownloader();
              ProxyProvider proxyProvider = SimpleProxyProvider.from(
                      new Proxy("182.61.179.157",8888)
              );
              downloader.setProxyProvider(proxyProvider);
              Spider.create(new MyPageProcessor())
                      //设置自定义Downloader组件
                      .setDownloader(downloader)
                      .addUrl("https://www.jd.com/")
                      .start();
          }
      }
      

四、使用selenium+无头浏览器

  1. selenium
    前端测试框架
    java
    .net
    python
    node.js

    通过代码控制浏览器。

    • package cn.sgwks.crawler.test;
      
      import org.openqa.selenium.chrome.ChromeDriver;
      import org.openqa.selenium.chrome.ChromeOptions;
      import org.openqa.selenium.remote.RemoteWebDriver;
      import org.springframework.stereotype.Component;
      
      @Component
      public class ChromeTest {
          public static void main(String[] args) {
              //创建配置参数
              System.setProperty("webdriver.chrome.driver",
                      "C:\\Users\\acer\\AppData\\Local\\Google\\Chrome\\Application\\chromedriver.exe");
              ChromeOptions chromeOptions = new ChromeOptions();
              //设置为 headless 模式 (测试不必须)
              //chromeOptions.addArguments("--headless");
              //设置浏览器窗口打开大小  (非必须)
              chromeOptions.addArguments("--window-size=1024,768");
              //创建WebDriver对象,采用了多态
              RemoteWebDriver webDriver = new ChromeDriver(chromeOptions);
              //使用WebDriver控制浏览器
              webDriver.get("https://www.jd.com/");
              String title = webDriver.getTitle();
              String h1Name = webDriver.findElementByCssSelector("#logo > h1 > a").getText();
              System.out.println(title);//京东(JD.COM)-正品低价、品质保障、配送及时、轻松购物!
              System.out.println(h1Name);//京东
              //线程睡眠5秒后将关闭浏览器
              try {
                  Thread.sleep(2000);
              } catch (InterruptedException e) {
                  e.printStackTrace();
              }
              webDriver.close();
          }
      }
      
  2. 无头浏览器
    没有图像界面的浏览器。
    phantomjs:无头浏览器,不再更新。将来会被淘汰。(了解)
    普通浏览器的无头浏览模式:
    chrome(推荐使用)
    Firefox

    • package cn.sgwks.crawler.test;
      
      import org.openqa.selenium.WebElement;
      import org.openqa.selenium.chrome.ChromeDriver;
      import org.openqa.selenium.chrome.ChromeOptions;
      import org.openqa.selenium.remote.RemoteWebDriver;
      import org.springframework.stereotype.Component;
      
      import java.util.List;
      
      @Component
      public class ChromeTest2 {
          public static void main(String[] args) {
              //创建配置参数
              System.setProperty("webdriver.chrome.driver",
                      "C:\\Users\\acer\\AppData\\Local\\Google\\Chrome\\Application\\chromedriver.exe");
              ChromeOptions chromeOptions = new ChromeOptions();
              //设置为 headless 模式 (测试不必须)
              //chromeOptions.addArguments("--headless");
              //设置浏览器窗口打开大小  (非必须)
              chromeOptions.addArguments("--window-size=1024,768");
              //创建WebDriver对象,采用了多态
              RemoteWebDriver webDriver = new ChromeDriver(chromeOptions);
              //使用WebDriver控制浏览器
              webDriver.get("https://search.jd.com/Search?keyword=%E6%89%8B%E6%9C%BA&enc=utf-8" +
                      "&suggest=1.his.0.0&wq=&pvid=86555b140ee64a70a9cdf1d5c7b836f4");
              webDriver.executeScript("window.scrollTo(0, document.body.scrollHeight - 300)");
              //线程睡眠5秒后将关闭浏览器
              try {
                  Thread.sleep(2000);
              } catch (InterruptedException e) {
                  e.printStackTrace();
              }
              //选择商品列表
              List<WebElement> list = webDriver.findElementsByCssSelector("li.gl-item");
              System.out.println(list.size());
              webDriver.close();
          }
      }
      
  3. chrome
    1)先安装chrome浏览器
    2)然后安装chrome浏览器的驱动selenium的驱动,应该放到chrome浏览器所在的目录下。
    3)编写代码
    1、向工程中添加jar包。selenium的jar包即可。
    2、创建浏览器的配置参数。
    3、创建一个WebDriver对象,代表浏览器。
    4、使用WebDriver对象控制浏览器。

    • package cn.sgwks.crawler.test;
      
      import org.openqa.selenium.WebElement;
      import org.openqa.selenium.chrome.ChromeDriver;
      import org.openqa.selenium.chrome.ChromeOptions;
      import org.openqa.selenium.remote.RemoteWebDriver;
      import org.springframework.stereotype.Component;
      
      import java.util.List;
      
      @Component
      public class ChromeTest3 {
          public static void main(String[] args) {
              //创建配置参数
              System.setProperty("webdriver.chrome.driver",
                      "C:\\Users\\acer\\AppData\\Local\\Google\\Chrome\\Application\\chromedriver.exe");
              ChromeOptions chromeOptions = new ChromeOptions();
              //设置为 headless 模式 (测试不必须)
              //chromeOptions.addArguments("--headless");
              //设置浏览器窗口打开大小  (非必须)
              chromeOptions.addArguments("--window-size=1024,768");
              //创建WebDriver对象,采用了多态
              RemoteWebDriver webDriver = new ChromeDriver(chromeOptions);
              //使用WebDriver控制浏览器
              webDriver.get("https://www.jd.com/");
              //在文本框内输入手机
              webDriver.findElementByCssSelector("#key").sendKeys("手机");
              try {
                  Thread.sleep(2000);
              } catch (InterruptedException e) {
                  e.printStackTrace();
              }
              //点击搜索按钮
              webDriver.findElementByCssSelector("#search > div > div.form > button").click();
              try {
                  Thread.sleep(2000);
              } catch (InterruptedException e) {
                  e.printStackTrace();
              }
              webDriver.executeScript("window.scrollTo(0, document.body.scrollHeight - 300)");
              //线程睡眠5秒后将关闭浏览器
              try {
                  Thread.sleep(2000);
              } catch (InterruptedException e) {
                  e.printStackTrace();
              }
              //选择商品列表
              List<WebElement> list = webDriver.findElementsByCssSelector("li.gl-item");
              System.out.println(list.size());
              /*String html = webDriver.getPageSource();
              System.out.println(html);*/
              webDriver.close();
          }
      }
      
  4. 扩展,爬去LOL角色详情的图片,并下载到本地

    • 工具类HttpsUtils

      package cn.sgwks.crawler.test;
      
      import org.apache.http.config.Registry;
      import org.apache.http.config.RegistryBuilder;
      import org.apache.http.conn.socket.ConnectionSocketFactory;
      import org.apache.http.conn.socket.PlainConnectionSocketFactory;
      import org.apache.http.conn.ssl.NoopHostnameVerifier;
      import org.apache.http.conn.ssl.SSLConnectionSocketFactory;
      import org.apache.http.conn.ssl.TrustStrategy;
      import org.apache.http.impl.client.CloseableHttpClient;
      import org.apache.http.impl.client.HttpClients;
      import org.apache.http.impl.conn.PoolingHttpClientConnectionManager;
      import org.apache.http.ssl.SSLContextBuilder;
      
      import java.security.cert.CertificateException;
      import java.security.cert.X509Certificate;
      
      public class HttpsUtils {
          private static final String HTTP = "http";
          private static final String HTTPS = "https";
          private static SSLConnectionSocketFactory sslsf = null;
          private static PoolingHttpClientConnectionManager cm = null;
          private static SSLContextBuilder builder = null;
          static {
              try {
                  builder = new SSLContextBuilder();
                  // 全部信任 不做身份鉴定
                  builder.loadTrustMaterial(null, new TrustStrategy() {
                      @Override
                      public boolean isTrusted(X509Certificate[] x509Certificates, String s) throws CertificateException {
                          return true;
                      }
                  });
                  sslsf = new SSLConnectionSocketFactory(builder.build(), new String[]{"SSLv2Hello", "SSLv3", "TLSv1", "TLSv1.2"}, null, NoopHostnameVerifier.INSTANCE);
                  Registry<ConnectionSocketFactory> registry = RegistryBuilder.<ConnectionSocketFactory>create()
                          .register(HTTP, new PlainConnectionSocketFactory())
                          .register(HTTPS, sslsf)
                          .build();
                  cm = new PoolingHttpClientConnectionManager(registry);
                  cm.setMaxTotal(200);//max connection
              } catch (Exception e) {
                  e.printStackTrace();
              }
          }
      
          public static CloseableHttpClient getHttpClient() throws Exception {
              CloseableHttpClient httpClient = HttpClients.custom()
                      .setSSLSocketFactory(sslsf)
                      .setConnectionManager(cm)
                      .setConnectionManagerShared(true)
                      .build();
              return httpClient;
          }
      
      }
      
    • package cn.sgwks.crawler.test;
      
      import org.apache.http.HttpEntity;
      import org.apache.http.client.methods.CloseableHttpResponse;
      import org.apache.http.client.methods.HttpGet;
      import org.apache.http.impl.client.CloseableHttpClient;
      import org.openqa.selenium.WebElement;
      import org.openqa.selenium.chrome.ChromeDriver;
      import org.openqa.selenium.chrome.ChromeOptions;
      import org.openqa.selenium.remote.RemoteWebDriver;
      import org.springframework.stereotype.Component;
      
      import java.io.FileOutputStream;
      import java.util.List;
      
      @Component
      public class ChromeTestLol{
      
          public static void chromeShow() {
              //创建配置参数
              System.setProperty("webdriver.chrome.driver",
                      "C:\\Users\\acer\\AppData\\Local\\Google\\Chrome\\Application\\chromedriver.exe");
              ChromeOptions chromeOptions = new ChromeOptions();
              //设置为 headless 模式 (测试不必须)
              chromeOptions.addArguments("--headless");
              //设置浏览器窗口打开大小  (非必须)
              chromeOptions.addArguments("--window-size=1024,768");
              //创建WebDriver对象,采用了多态
              RemoteWebDriver webDriver = new ChromeDriver(chromeOptions);
              //使用WebDriver控制浏览器
              webDriver.get("https://lol.qq.com/data/info-heros.shtml");
              try {
                  Thread.sleep(1000);
              } catch (InterruptedException e) {
                  e.printStackTrace();
              }
              webDriver.executeScript("window.scrollTo(0, document.body.scrollHeight - 200)");
              List<WebElement> webElements = webDriver.findElementsByCssSelector("#jSearchHeroDiv > li > a > img");
              System.out.println(webElements.size());
              for (WebElement webElement : webElements) {
                  String image = webElement.getAttribute("src");
                  String imageName = webElement.getAttribute("alt");
                  downPic(image, imageName);
              }
              webDriver.close();
          }
      
          public static void downPic(String image, String imageName) {
              try {
                  //这个用了连接池对象
                  CloseableHttpClient httpClient = HttpsUtils.getHttpClient();
                  //默认初始化
                  //CloseableHttpClient httpClient = HttpClients.createDefault();
                  HttpGet get = new HttpGet(image);
                  CloseableHttpResponse response = httpClient.execute(get);
                  HttpEntity entity = response.getEntity();
                  String extName = image.substring(image.lastIndexOf("."));
                  String fileName  = imageName + extName;
                  FileOutputStream fos = new FileOutputStream("C:\\Users\\acer\\Desktop\\lol\\"+fileName);
                  entity.writeTo(fos);
                  response.close();
                  //用了连接池就不需要关闭客户端连接,否则还需要创建
                  //httpClient.close();
              } catch (Exception e) {
                  e.printStackTrace();
              }
          }
          public static void main(String[] args) {
              chromeShow();
          }
      }
      

五、综合案例

  1. 实现京东商城数据的抓取。每页要求抓取60条数据。

  2. 分析
    1)访问搜索的url
    2)页面加载完成之后,执行页面滚动,取后30条数据。
    3)从列表页面取spu、sku,保存到数据库
    4)从列表页面取详情页面的url列表,添加到队列中。
    5)翻页处理
    http://nextpage.com?url=前一页的url
    6)如果是详情页面解析页面中的商品信息,更新到数据库,根据sku更新。
    7)数据的持久化
    自定义pipeline,实现数据的保存。
    Springboot+SpringDataJpa

  3. 工程搭建
    springboot工程
    添加web、jap起步依赖。

    • <?xml version="1.0" encoding="UTF-8"?>
      <project xmlns="http://maven.apache.org/POM/4.0.0"
               xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
               xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
          <modelVersion>4.0.0</modelVersion>
      
          <groupId>cn.sgwks</groupId>
          <artifactId>crawlerchrome-jd</artifactId>
          <version>1.0-SNAPSHOT</version>
          <parent>
              <groupId>org.springframework.boot</groupId>
              <artifactId>spring-boot-starter-parent</artifactId>
              <version>2.0.2.RELEASE</version>
          </parent>
      
          <properties>
              <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
              <maven.compiler.source>1.8</maven.compiler.source>
              <maven.compiler.target>1.8</maven.compiler.target>
          </properties>
          <dependencies>
              <!--WebMagic核心包-->
              <dependency>
                  <groupId>us.codecraft</groupId>
                  <artifactId>webmagic-core</artifactId>
                  <version>0.7.3</version>
                  <exclusions>
                      <exclusion>
                          <groupId>org.slf4j</groupId>
                          <artifactId>slf4j-log4j12</artifactId>
                      </exclusion>
                  </exclusions>
              </dependency>
              <!--WebMagic扩展-->
              <dependency>
                  <groupId>us.codecraft</groupId>
                  <artifactId>webmagic-extension</artifactId>
                  <version>0.7.3</version>
              </dependency>
      
              <!--工具包-->
              <dependency>
                  <groupId>org.apache.commons</groupId>
                  <artifactId>commons-lang3</artifactId>
              </dependency>
              <!--SpringMVC-->
              <dependency>
                  <groupId>org.springframework.boot</groupId>
                  <artifactId>spring-boot-starter-web</artifactId>
              </dependency>
      
              <!--SpringData Jpa-->
              <dependency>
                  <groupId>org.springframework.boot</groupId>
                  <artifactId>spring-boot-starter-data-jpa</artifactId>
              </dependency>
      
              <!--单元测试-->
              <dependency>
                  <groupId>org.springframework.boot</groupId>
                  <artifactId>spring-boot-starter-test</artifactId>
              </dependency>
      
              <!--MySQL连接包-->
              <dependency>
                  <groupId>mysql</groupId>
                  <artifactId>mysql-connector-java</artifactId>
              </dependency>
      
              <dependency>
                  <groupId>org.seleniumhq.selenium</groupId>
                  <artifactId>selenium-java</artifactId>
                  <version>3.13.0</version>
              </dependency>
      
          </dependencies>
      
      </project>
      

    ​ 创建实体类、dao

    • package cn.sgwks.crawlerjd.entity;
      
      import javax.persistence.*;
      import java.util.Date;
      
      @Entity
      @Table(name = "jd_item")
      public class Item {
          @Id
          @GeneratedValue(strategy = GenerationType.IDENTITY)
          private Long id;
          private Long spu;
          private Long sku;
          private String title;
          private Float price;
          private String pic;
          private String url;
          private Date created;
          private Date updated;
      
          public Long getId() {
              return id;
          }
      
          public void setId(Long id) {
              this.id = id;
          }
      
          public Long getSpu() {
              return spu;
          }
      
          public void setSpu(Long spu) {
              this.spu = spu;
          }
      
          public Long getSku() {
              return sku;
          }
      
          public void setSku(Long sku) {
              this.sku = sku;
          }
      
          public String getTitle() {
              return title;
          }
      
          public void setTitle(String title) {
              this.title = title;
          }
      
          public Float getPrice() {
              return price;
          }
      
          public void setPrice(Float price) {
              this.price = price;
          }
      
          public String getPic() {
              return pic;
          }
      
          public void setPic(String pic) {
              this.pic = pic;
          }
      
          public String getUrl() {
              return url;
          }
      
          public void setUrl(String url) {
              this.url = url;
          }
      
          public Date getCreated() {
              return created;
          }
      
          public void setCreated(Date created) {
              this.created = created;
          }
      
          public Date getUpdated() {
              return updated;
          }
      
          public void setUpdated(Date updated) {
              this.updated = updated;
          }
      }
      
      
    • package cn.sgwks.crawlerjd.dao;
      
      import cn.sgwks.crawlerjd.entity.Item;
      import org.springframework.data.jpa.repository.JpaRepository;
      
      public interface ItemDao extends JpaRepository<Item,Long> {
          Item findBySku(Long sku);
      }
      
      

    ​ 编写配置文件和引导类。

    • #DB Configuration:
      spring:
        datasource:
          driver-class-name: com.mysql.jdbc.Driver
          url: jdbc:mysql://127.0.0.1:3306/crawler-sgw?useUnicode=true&characterEncoding=utf8
          username: root
          password: root
        #JPA Configuration:
        jpa:
          database: mysql
          show-sql: true
          generate-ddl: true
          hibernate:
            ddl-auto: update
      
    • package cn.sgwks.crawlerjd;
      
      import org.springframework.boot.SpringApplication;
      import org.springframework.boot.autoconfigure.SpringBootApplication;
      import org.springframework.scheduling.annotation.EnableScheduling;
      
      @SpringBootApplication
      @EnableScheduling
      public class SGWApplication {
          public static void main(String[] args) {
              SpringApplication.run(SGWApplication.class, args);
          }
      }
      
    • package cn.sgwks.crawlerjd.component;
      
      import org.springframework.beans.factory.annotation.Autowired;
      import org.springframework.scheduling.annotation.Scheduled;
      import org.springframework.stereotype.Component;
      import us.codecraft.webmagic.Spider;
      import us.codecraft.webmagic.pipeline.Pipeline;
      import us.codecraft.webmagic.processor.PageProcessor;
      import us.codecraft.webmagic.proxy.Proxy;
      import us.codecraft.webmagic.proxy.ProxyProvider;
      import us.codecraft.webmagic.proxy.SimpleProxyProvider;
      
      @Component
      public class JdSpider {
          private String startUrl = "https://search.jd.com/Search?keyword=%E6%89%8B%E6%9C%BA&enc=utf-8" +
                  "&pvid=b618725e7d6846fd98c41d3b55dbf38c";
          //private String startUrl = "https://www.jd.com/";
      
          @Autowired
          private PageProcessor pageProcessor;
          @Autowired
          private Pipeline pipeline;
          @Autowired
          private JdHttpClientDownloader downloader;
          /**
           * 定时24小时爬一次
           */
          @Scheduled(fixedRate = 1000 * 60 * 60 * 24)
          public void start() {
              //创建一个Downloader组件
              ProxyProvider proxyProvider = SimpleProxyProvider.from(
                      new Proxy("39.137.69.6",80),
                      new Proxy("39.137.69.7",8080),
                      new Proxy("150.138.253.73",808),
                      new Proxy("182.92.113.148",8118),
                      new Proxy("39.137.69.*",80),
                      new Proxy("39.137.69.*",8080),
                      new Proxy("150.138.253.**",808),
                      new Proxy("182.92.113.***",8118)
              );
              downloader.setProxyProvider(proxyProvider);
              Spider.create(pageProcessor)
                      .setDownloader(downloader)
                      .addPipeline(pipeline)
                      .addUrl(startUrl)
                      .start();
          }
      }
      
  4. Downloader

    • 初始化的url
      直接访问
      页面滚动
      取渲染之后的html
      封装成Page对象

    • 商品详情页面的url
      直接访问
      取渲染之后的html
      封装成Page对象返回

    • 翻页的url
      判断是否url是以“http://nextpage.com”开头
      取前一页的url,并访问
      点击“下一页”按钮
      页面滚动到下方
      取渲染之后的html
      封装成Page对象返回

    • package cn.sgwks.crawlerjd.component;
      
      import org.openqa.selenium.WebElement;
      import org.openqa.selenium.chrome.ChromeDriver;
      import org.openqa.selenium.chrome.ChromeOptions;
      import org.openqa.selenium.remote.RemoteWebDriver;
      import org.springframework.stereotype.Component;
      import us.codecraft.webmagic.Page;
      import us.codecraft.webmagic.Request;
      import us.codecraft.webmagic.Task;
      import us.codecraft.webmagic.downloader.HttpClientDownloader;
      import us.codecraft.webmagic.selector.PlainText;
      
      import java.util.List;
      
      @Component
      public class JdHttpClientDownloader extends HttpClientDownloader {
      
          private RemoteWebDriver webDriver;
      
          public JdHttpClientDownloader(){
              //创建配置参数
              System.setProperty("webdriver.chrome.driver",
                      "C:\\Users\\acer\\AppData\\Local\\Google\\Chrome\\Application\\chromedriver.exe");
              ChromeOptions chromeOptions = new ChromeOptions();
              //设置为 headless 模式 (测试不必须)
              chromeOptions.addArguments("--headless");
              //设置浏览器窗口打开大小  (非必须)
              chromeOptions.addArguments("--window-size=1024,768");
              //创建WebDriver对象
              webDriver = new ChromeDriver(chromeOptions);
          }
          public Page download(Request request, Task task) {
              try {
                  //取url
                  String url = request.getUrl();
                  //判断是否是分页 的url
                  if (!url.contains("http://nextpage.com")) {
                      //1、初始化的url
                      //	直接访问
                      webDriver.get(url);
                      List<WebElement> webElementList = webDriver.findElementsByCssSelector("li.gl-item");
                      //判断是是列表页面
                      if (webElementList.size() > 0) {
                          //页面滚动
                          webDriver.executeScript("window.scrollTo(0, document.body.scrollHeight - 300)");
                          Thread.sleep(1000);
                          //取渲染之后的html
                          String htmlStr = webDriver.getPageSource();
                          //封装成Page对象
                          return createPage(htmlStr, url);
                      } else {
                          //2、商品详情页面的url
                          //	直接访问
                          //	取渲染之后的html
                          String htmlStr = webDriver.getPageSource();
                          //	封装成Page对象返回
                          return createPage(htmlStr, url);
                      }
                  } else {
                      //3、翻页的url
                      //	判断是否url是以“http://nextpage.com”开头
                      //	取前一页的url,并访问
                      String prePageUrl = (String) request.getExtra("url");
                      webDriver.get(prePageUrl);
                      //	点击“下一页”按钮
                      webDriver.findElementByCssSelector("#J_topPage > a.fp-next").click();
                      Thread.sleep(1000);
                      //	页面滚动到下方
                      webDriver.executeScript("window.scrollTo(0, document.body.scrollHeight - 300)");
                      Thread.sleep(1000);
                      //	取渲染之后的html
                      String htmlStr = webDriver.getPageSource();
                      //	封装成Page对象返回,第二个参数为详情页面的url
                      return createPage(htmlStr, webDriver.getCurrentUrl());
                  }
              } catch (Exception e) {
                  e.printStackTrace();
              }
              return Page.fail();
          }
          public void setThread(int thread) { /* compiled code */ }
      
          /**
           * 将html封装成page对象
           * @param html
           * @param url
           * @return
           */
          private Page createPage(String html, String url) {
              Page page = new Page();
              //给page设置html属性
              page.setRawText(html);
              //设置url
              page.setUrl(new PlainText(url));
              //设置request对象
              page.setRequest(new Request(url));
              //设置页面抓取成功
              page.setDownloadSuccess(true);
              return page;
          }
      }
      
  5. PageProcess
    1)判断是否是列表页面
    2)如果列表页面
    3)从列表页面取spu、sku封装到列表中,传递给pipeline
    4)取详情页面对象的url列表,添加到访问队列中。
    5)创建翻页的url封装成Request对象添加到队列中。
    6)如果是详情页面
    7)取商品的详细信息,封装到Item对象中,传递给pipeline

    • package cn.sgwks.crawlerjd.component;
      
      import cn.sgwks.crawlerjd.entity.Item;
      import cn.sgwks.crawlerjd.utils.HttpsUtils;
      import org.apache.http.HttpEntity;
      import org.apache.http.client.methods.CloseableHttpResponse;
      import org.apache.http.client.methods.HttpGet;
      import org.apache.http.impl.client.CloseableHttpClient;
      import org.springframework.stereotype.Component;
      import us.codecraft.webmagic.Page;
      import us.codecraft.webmagic.Request;
      import us.codecraft.webmagic.Site;
      import us.codecraft.webmagic.processor.PageProcessor;
      import us.codecraft.webmagic.selector.Html;
      import us.codecraft.webmagic.selector.Selectable;
      
      import java.io.FileOutputStream;
      import java.util.ArrayList;
      import java.util.List;
      import java.util.UUID;
      
      /**
       * 解析数据
       */
      @Component
      public class JdPageProcessor implements PageProcessor {
          @Override
          public void process(Page page) {
              //取html对象
              Html html = page.getHtml();
              List<Selectable> nodes = html.css("li.gl-item").nodes();
              //1)判断是否是列表页面
              if (nodes.size() > 0) {
                  //2)如果列表页面
                  //3)从列表页面取spu、sku封装到列表中,传递给pipeline
                  ArrayList<Item> itemList = new ArrayList<>();
                  for (Selectable node : nodes) {
                      String spu = node.css("li", "data-spu").get();
                      String sku = node.css("li", "data-sku").get();
                      //封装到Item对象中
                      Item item = new Item();
                      item.setSpu(Long.parseLong(spu));
                      item.setSku(Long.parseLong(sku));
                      //添加到列表
                      itemList.add(item);
                  }
                  //把集合对象添加到序列中
                  page.putField("itemList", itemList);
                  //4)取详情页面对象的url列表,添加到访问队列中。
                  List<String> urlList = html.css("li.gl-item div.p-img").links().all();
                  page.addTargetRequests(urlList);
                  //5)创建翻页的url封装成Request对象添加到队列中。
                  String nextPageUrl = "http://nextpage.com?url=" + page.getUrl().get();
                  Request request = new Request(nextPageUrl);
                  request.putExtra("url", page.getUrl().get());
                  page.addTargetRequest(request);
              } else {
                  //6)如果是详情页面
                  //sku
                  String sku = html.css("div.preview-info a.follow.J-follow", "data-id").get();
                  //商品标题
                  String title = html.css("div.itemInfo-wrap div.sku-name", "text").get();
                  //商品价格
                  String price = html.css("div.dd span.p-price span.price", "text").get();
                  //商品图片
                  String picUrl = html.css("#spec-img", "src").get();
                  String picTitle = html.css("#spec-img", "alt").get();
                  downloadImage(picUrl, picTitle);
                  //商品的url
                  String itemUrl = page.getUrl().get();
                  //7)取商品的详细信息,封装到Item对象中,传递给pipeline
                  Item item = new Item();
                  item.setSku(Long.parseLong(sku));
                  item.setTitle(title);
                  item.setPrice(Float.parseFloat(price));
                  item.setPic(picUrl);
                  item.setUrl(itemUrl);
                  //传递给pipeline
                  page.putField("item", item);
              }
          }
      
          @Override
          public Site getSite() {
              return Site.me();
          }
      
          /**
           * 图片下载
           * @param imgUrl
           * @return
           */
          private static void downloadImage(String imgUrl, String title) {
              try {
                  //创建一个HttpClient对象
                  CloseableHttpClient httpClient = HttpsUtils.getHttpClient();
                  //创建一个HttpGet对象
                  HttpGet get = new HttpGet("https:"+imgUrl);
                  //发送请求
                  CloseableHttpResponse response = httpClient.execute(get);
                  //接收服务端响应的内容。
                  HttpEntity entity = response.getEntity();
                  //需要截取扩展名
                  String extName = imgUrl.substring(imgUrl.lastIndexOf("."));
                  //需要生成文件名。可以使用uuid生成文件名。并去除特殊字符
                  String regEx = "[\n`~!@#$%^&*()+=|{}':;',\\[\\].<>/?~!@#¥%……&*()——+|{}【】‘;:”“’。, 、?]";
                  //需要生成文件名。可以使用uuid生成文件名。
                  String uuid = UUID.randomUUID().toString().substring(0, 5);
                  String prefix = title.replaceAll(regEx, "");
                  String fileName = prefix.substring(0, 15) + uuid + extName;
                  //存放地址 C:\Users\acer\Desktop\jdPhone
                  //创建一个文件输出流,把文件保存到磁盘
                  FileOutputStream fos = new FileOutputStream("C:\\Users\\acer\\Desktop\\jdPhone\\" + fileName);
                  //接收流,把内容保存到磁盘。
                  entity.writeTo(fos);
                  //关闭流
                  fos.close();
                  //关闭Response对象
                  response.close();
              } catch (Exception e) {
                  e.printStackTrace();
              }
          }
      }
      
    • 工具类 HttpsUtils

      package cn.sgwks.crawlerjd.utils;
      
      import org.apache.http.config.Registry;
      import org.apache.http.config.RegistryBuilder;
      import org.apache.http.conn.socket.ConnectionSocketFactory;
      import org.apache.http.conn.socket.PlainConnectionSocketFactory;
      import org.apache.http.conn.ssl.NoopHostnameVerifier;
      import org.apache.http.conn.ssl.SSLConnectionSocketFactory;
      import org.apache.http.conn.ssl.TrustStrategy;
      import org.apache.http.impl.client.CloseableHttpClient;
      import org.apache.http.impl.client.HttpClients;
      import org.apache.http.impl.conn.PoolingHttpClientConnectionManager;
      import org.apache.http.ssl.SSLContextBuilder;
      
      import java.security.cert.CertificateException;
      import java.security.cert.X509Certificate;
      
      public class HttpsUtils {
          private static final String HTTP = "http";
          private static final String HTTPS = "https";
          private static SSLConnectionSocketFactory sslsf = null;
          private static PoolingHttpClientConnectionManager cm = null;
          private static SSLContextBuilder builder = null;
          static {
              try {
                  builder = new SSLContextBuilder();
                  // 全部信任 不做身份鉴定
                  builder.loadTrustMaterial(null, new TrustStrategy() {
                      @Override
                      public boolean isTrusted(X509Certificate[] x509Certificates, String s) throws CertificateException {
                          return true;
                      }
                  });
                  sslsf = new SSLConnectionSocketFactory(builder.build(), new String[]{"SSLv2Hello", "SSLv3", "TLSv1", "TLSv1.2"}, null, NoopHostnameVerifier.INSTANCE);
                  Registry<ConnectionSocketFactory> registry = RegistryBuilder.<ConnectionSocketFactory>create()
                          .register(HTTP, new PlainConnectionSocketFactory())
                          .register(HTTPS, sslsf)
                          .build();
                  cm = new PoolingHttpClientConnectionManager(registry);
                  cm.setMaxTotal(200);//max connection
              } catch (Exception e) {
                  e.printStackTrace();
              }
          }
      
          public static CloseableHttpClient getHttpClient() throws Exception {
              CloseableHttpClient httpClient = HttpClients.custom()
                      .setSSLSocketFactory(sslsf)
                      .setConnectionManager(cm)
                      .setConnectionManagerShared(true)
                      .build();
              return httpClient;
          }
      }
      
  6. pipeline
    从resultItems对象中取数据
    1、取列表数据
    2、如果列表数据不为null
    3、把列表数据插入到数据库中。
    4、取商品数据
    5、如果商品数据不为null
    6、根据sku查询数据
    7、更新数据
    8、保存到数据库

    • package cn.sgwks.crawlerjd.component;
      
      import cn.sgwks.crawlerjd.dao.ItemDao;
      import cn.sgwks.crawlerjd.entity.Item;
      import org.springframework.beans.factory.annotation.Autowired;
      import org.springframework.stereotype.Component;
      import us.codecraft.webmagic.ResultItems;
      import us.codecraft.webmagic.Task;
      import us.codecraft.webmagic.pipeline.Pipeline;
      
      import java.util.Date;
      import java.util.List;
      
      /**
       * 数据持久化组件
       */
      @Component
      public class JdPipeline implements Pipeline {
      
          @Autowired
          private ItemDao itemDao;
      
          @Override
          public void process(ResultItems resultItems, Task task) {
              //从resultItems对象中取数据
              //1、取列表数据
              List<Item> itemList = resultItems.get("itemList");
              //2、如果列表数据不为null
              if (itemList != null) {
                  //3、把列表数据插入到数据库中。
                  for (Item item : itemList) {
                      item.setCreated(new Date());
                      item.setUpdated(new Date());
                      itemDao.save(item);
                  }
              }
              //4、取商品数据
              Item item = resultItems.get("item");
              //5、如果商品数据不为null
              if (item != null) {
                  //6、根据sku查询数据
                  Item item1 = itemDao.findBySku(item.getSku());
                  //7、更新数据
                  item1.setTitle(item.getTitle());
                  item1.setPrice(item.getPrice());
                  item1.setPic(item.getPic());
                  item1.setUrl(item.getUrl());
                  item1.setUpdated(new Date());
                  //8、保存到数据库
                  itemDao.save(item1);
              }
          }
      }
      
标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!