文章目录

Python第三章

一、课程计划
二、定时器
三、代理的使用
四、使用selenium+无头浏览器
五、综合案例

Python第三章

一、课程计划

网络爬虫：
抓取页面：请求url响应html
HttpClient
解析页面：
使用Jsoup
爬虫框架：
webmagic：
Downloader：下载器
PageProcess：页面解析的业务逻辑
Pipeline：数据持久化。
Scheduler：url队列
课程计划：
爬虫的高级技术：
1）定时器
2）代理的使用
3）selenium+无头浏览器
4）综合案例

二、定时器

Timer

Quartz：定时框架
功能强大，使用繁琐。

package cn.sgwks.crawler;

import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;
import org.springframework.scheduling.annotation.EnableScheduling;

@SpringBootApplication
//开启定时器
@EnableScheduling
public class ChromeApplication {
    public static void main(String[] args) {
        SpringApplication.run(ChromeApplication.class, args);
    }
}

spring中使用定时
也是使用Quartz框架。
在Springboot框架中使用定时器：
1）@Scheduled
在需要定期执行的方法上添加此注解。
2）在Springboot的引导类
@EnableScheduling
创建springboot工程：
1）工程必须是maven工程
2）工程必须继承spring-boot-stater-parent工程
3）添加起步依赖
4）application.yml(properties)
5）引导类，其中包含main方法

package cn.sgwks.crawler.test;

import org.springframework.scheduling.annotation.Scheduled;
import org.springframework.stereotype.Component;

import java.util.Date;

@Component
public class SchedulerTest {
    @Scheduled(
            /**
             * fixedDelay：固定延迟，固定延迟多长时间执行。long类型
             * fixedDelayString：固定延迟执行，同fixedDelay，数据类型是String类型。
             * fixedRate：固定周期执行
             * fixedRateString：字符串类型的值，使用方法同fixedRate
            */
            //fixedDelay = 1000
            //fixedRate = 3000
            //cron = "0/5 * * * * ? "
            cron = "0,2,5 * * * * ? "
    )
    public void printTime() {
        System.out.println(new Date().toLocaleString());
    }
}

需求
定时向控制台输出当前时间
@Scheduled
fixedDelay：固定延迟，固定延迟多长时间执行。long类型
fixedDelayString：固定延迟执行，同fixedDelay，数据类型是String类型。
fixedRate：固定周期执行
fixedRateString：字符串类型的值，使用方法同fixedRate
复杂的周期执行应该使用cron表达式：
cron属性的值就是cron表达式，就是一个字符串。
在此注解中不支持年份，表达式只能是6段

三、代理的使用

应用场景
防止服务器识别出爬虫。
获得代理服务器
可以找一些免费的代理服务器
米扑代理
https://proxy.mimvp.com/free.php
西刺免费代理IP
http://www.xicidaili.com/

使用方法
在webmagic框架中使用代理。
应该创建一个Downloader对象，配置代理服务器。
1、创建一个PageProcessor对象。

2、创建一个Downloader对象，可以使用HttpClientDownloader
3、在Downloader对象中配置代理服务器。
4、使用Spider类组装爬虫。
5、执行爬虫

package cn.sgwks.crawler.test;

import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.downloader.HttpClientDownloader;
import us.codecraft.webmagic.processor.PageProcessor;
import us.codecraft.webmagic.proxy.Proxy;
import us.codecraft.webmagic.proxy.ProxyProvider;
import us.codecraft.webmagic.proxy.SimpleProxyProvider;

public class MyPageProcessor implements PageProcessor {

    @Override
    public void process(Page page) {
        String html = page.getHtml().get();
        page.putField("html", html);
    }

    @Override
    public Site getSite() {
        return Site.me();
    }

    public static void main(String[] args) {
        //创建一个Downloader组件
        HttpClientDownloader downloader = new HttpClientDownloader();
        ProxyProvider proxyProvider = SimpleProxyProvider.from(
                new Proxy("182.61.179.157",8888)
        );
        downloader.setProxyProvider(proxyProvider);
        Spider.create(new MyPageProcessor())
                //设置自定义Downloader组件
                .setDownloader(downloader)
                .addUrl("https://www.jd.com/")
                .start();
    }
}

四、使用selenium+无头浏览器

selenium
前端测试框架
java
.net
python
node.js

通过代码控制浏览器。

package cn.sgwks.crawler.test;

import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.chrome.ChromeOptions;
import org.openqa.selenium.remote.RemoteWebDriver;
import org.springframework.stereotype.Component;

@Component
public class ChromeTest {
    public static void main(String[] args) {
        //创建配置参数
        System.setProperty("webdriver.chrome.driver",
                "C:\\Users\\acer\\AppData\\Local\\Google\\Chrome\\Application\\chromedriver.exe");
        ChromeOptions chromeOptions = new ChromeOptions();
        //设置为 headless 模式 （测试不必须）
        //chromeOptions.addArguments("--headless");
        //设置浏览器窗口打开大小  （非必须）
        chromeOptions.addArguments("--window-size=1024,768");
        //创建WebDriver对象,采用了多态
        RemoteWebDriver webDriver = new ChromeDriver(chromeOptions);
        //使用WebDriver控制浏览器
        webDriver.get("https://www.jd.com/");
        String title = webDriver.getTitle();
        String h1Name = webDriver.findElementByCssSelector("#logo > h1 > a").getText();
        System.out.println(title);//京东(JD.COM)-正品低价、品质保障、配送及时、轻松购物！
        System.out.println(h1Name);//京东
        //线程睡眠5秒后将关闭浏览器
        try {
            Thread.sleep(2000);
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
        webDriver.close();
    }
}

无头浏览器
没有图像界面的浏览器。
phantomjs：无头浏览器，不再更新。将来会被淘汰。（了解）
普通浏览器的无头浏览模式：
chrome（推荐使用）
Firefox

package cn.sgwks.crawler.test;

import org.openqa.selenium.WebElement;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.chrome.ChromeOptions;
import org.openqa.selenium.remote.RemoteWebDriver;
import org.springframework.stereotype.Component;

import java.util.List;

@Component
public class ChromeTest2 {
    public static void main(String[] args) {
        //创建配置参数
        System.setProperty("webdriver.chrome.driver",
                "C:\\Users\\acer\\AppData\\Local\\Google\\Chrome\\Application\\chromedriver.exe");
        ChromeOptions chromeOptions = new ChromeOptions();
        //设置为 headless 模式 （测试不必须）
        //chromeOptions.addArguments("--headless");
        //设置浏览器窗口打开大小  （非必须）
        chromeOptions.addArguments("--window-size=1024,768");
        //创建WebDriver对象,采用了多态
        RemoteWebDriver webDriver = new ChromeDriver(chromeOptions);
        //使用WebDriver控制浏览器
        webDriver.get("https://search.jd.com/Search?keyword=%E6%89%8B%E6%9C%BA&enc=utf-8" +
                "&suggest=1.his.0.0&wq=&pvid=86555b140ee64a70a9cdf1d5c7b836f4");
        webDriver.executeScript("window.scrollTo(0, document.body.scrollHeight - 300)");
        //线程睡眠5秒后将关闭浏览器
        try {
            Thread.sleep(2000);
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
        //选择商品列表
        List<WebElement> list = webDriver.findElementsByCssSelector("li.gl-item");
        System.out.println(list.size());
        webDriver.close();
    }
}

chrome
1）先安装chrome浏览器
2）然后安装chrome浏览器的驱动selenium的驱动，应该放到chrome浏览器所在的目录下。
3）编写代码
1、向工程中添加jar包。selenium的jar包即可。
2、创建浏览器的配置参数。
3、创建一个WebDriver对象，代表浏览器。
4、使用WebDriver对象控制浏览器。

package cn.sgwks.crawler.test;

import org.openqa.selenium.WebElement;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.chrome.ChromeOptions;
import org.openqa.selenium.remote.RemoteWebDriver;
import org.springframework.stereotype.Component;

import java.util.List;

@Component
public class ChromeTest3 {
    public static void main(String[] args) {
        //创建配置参数
        System.setProperty("webdriver.chrome.driver",
                "C:\\Users\\acer\\AppData\\Local\\Google\\Chrome\\Application\\chromedriver.exe");
        ChromeOptions chromeOptions = new ChromeOptions();
        //设置为 headless 模式 （测试不必须）
        //chromeOptions.addArguments("--headless");
        //设置浏览器窗口打开大小  （非必须）
        chromeOptions.addArguments("--window-size=1024,768");
        //创建WebDriver对象,采用了多态
        RemoteWebDriver webDriver = new ChromeDriver(chromeOptions);
        //使用WebDriver控制浏览器
        webDriver.get("https://www.jd.com/");
        //在文本框内输入手机
        webDriver.findElementByCssSelector("#key").sendKeys("手机");
        try {
            Thread.sleep(2000);
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
        //点击搜索按钮
        webDriver.findElementByCssSelector("#search > div > div.form > button").click();
        try {
            Thread.sleep(2000);
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
        webDriver.executeScript("window.scrollTo(0, document.body.scrollHeight - 300)");
        //线程睡眠5秒后将关闭浏览器
        try {
            Thread.sleep(2000);
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
        //选择商品列表
        List<WebElement> list = webDriver.findElementsByCssSelector("li.gl-item");
        System.out.println(list.size());
        /*String html = webDriver.getPageSource();
        System.out.println(html);*/
        webDriver.close();
    }
}

扩展，爬去LOL角色详情的图片,并下载到本地

工具类HttpsUtils

package cn.sgwks.crawler.test;

import org.apache.http.config.Registry;
import org.apache.http.config.RegistryBuilder;
import org.apache.http.conn.socket.ConnectionSocketFactory;
import org.apache.http.conn.socket.PlainConnectionSocketFactory;
import org.apache.http.conn.ssl.NoopHostnameVerifier;
import org.apache.http.conn.ssl.SSLConnectionSocketFactory;
import org.apache.http.conn.ssl.TrustStrategy;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.impl.conn.PoolingHttpClientConnectionManager;
import org.apache.http.ssl.SSLContextBuilder;

import java.security.cert.CertificateException;
import java.security.cert.X509Certificate;

public class HttpsUtils {
    private static final String HTTP = "http";
    private static final String HTTPS = "https";
    private static SSLConnectionSocketFactory sslsf = null;
    private static PoolingHttpClientConnectionManager cm = null;
    private static SSLContextBuilder builder = null;
    static {
        try {
            builder = new SSLContextBuilder();
            // 全部信任 不做身份鉴定
            builder.loadTrustMaterial(null, new TrustStrategy() {
                @Override
                public boolean isTrusted(X509Certificate[] x509Certificates, String s) throws CertificateException {
                    return true;
                }
            });
            sslsf = new SSLConnectionSocketFactory(builder.build(), new String[]{"SSLv2Hello", "SSLv3", "TLSv1", "TLSv1.2"}, null, NoopHostnameVerifier.INSTANCE);
            Registry<ConnectionSocketFactory> registry = RegistryBuilder.<ConnectionSocketFactory>create()
                    .register(HTTP, new PlainConnectionSocketFactory())
                    .register(HTTPS, sslsf)
                    .build();
            cm = new PoolingHttpClientConnectionManager(registry);
            cm.setMaxTotal(200);//max connection
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    public static CloseableHttpClient getHttpClient() throws Exception {
        CloseableHttpClient httpClient = HttpClients.custom()
                .setSSLSocketFactory(sslsf)
                .setConnectionManager(cm)
                .setConnectionManagerShared(true)
                .build();
        return httpClient;
    }

}

package cn.sgwks.crawler.test;

import org.apache.http.HttpEntity;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.chrome.ChromeOptions;
import org.openqa.selenium.remote.RemoteWebDriver;
import org.springframework.stereotype.Component;

import java.io.FileOutputStream;
import java.util.List;

@Component
public class ChromeTestLol{

    public static void chromeShow() {
        //创建配置参数
        System.setProperty("webdriver.chrome.driver",
                "C:\\Users\\acer\\AppData\\Local\\Google\\Chrome\\Application\\chromedriver.exe");
        ChromeOptions chromeOptions = new ChromeOptions();
        //设置为 headless 模式 （测试不必须）
        chromeOptions.addArguments("--headless");
        //设置浏览器窗口打开大小  （非必须）
        chromeOptions.addArguments("--window-size=1024,768");
        //创建WebDriver对象,采用了多态
        RemoteWebDriver webDriver = new ChromeDriver(chromeOptions);
        //使用WebDriver控制浏览器
        webDriver.get("https://lol.qq.com/data/info-heros.shtml");
        try {
            Thread.sleep(1000);
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
        webDriver.executeScript("window.scrollTo(0, document.body.scrollHeight - 200)");
        List<WebElement> webElements = webDriver.findElementsByCssSelector("#jSearchHeroDiv > li > a > img");
        System.out.println(webElements.size());
        for (WebElement webElement : webElements) {
            String image = webElement.getAttribute("src");
            String imageName = webElement.getAttribute("alt");
            downPic(image, imageName);
        }
        webDriver.close();
    }

    public static void downPic(String image, String imageName) {
        try {
            //这个用了连接池对象
            CloseableHttpClient httpClient = HttpsUtils.getHttpClient();
            //默认初始化
            //CloseableHttpClient httpClient = HttpClients.createDefault();
            HttpGet get = new HttpGet(image);
            CloseableHttpResponse response = httpClient.execute(get);
            HttpEntity entity = response.getEntity();
            String extName = image.substring(image.lastIndexOf("."));
            String fileName  = imageName + extName;
            FileOutputStream fos = new FileOutputStream("C:\\Users\\acer\\Desktop\\lol\\"+fileName);
            entity.writeTo(fos);
            response.close();
            //用了连接池就不需要关闭客户端连接，否则还需要创建
            //httpClient.close();
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
    public static void main(String[] args) {
        chromeShow();
    }
}

五、综合案例

实现京东商城数据的抓取。每页要求抓取60条数据。
分析
1）访问搜索的url
2）页面加载完成之后，执行页面滚动，取后30条数据。
3）从列表页面取spu、sku，保存到数据库
4）从列表页面取详情页面的url列表，添加到队列中。
5）翻页处理
http://nextpage.com?url=前一页的url
6）如果是详情页面解析页面中的商品信息，更新到数据库，根据sku更新。
7）数据的持久化
自定义pipeline，实现数据的保存。
Springboot+SpringDataJpa

工程搭建
springboot工程
添加web、jap起步依赖。

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>cn.sgwks</groupId>
    <artifactId>crawlerchrome-jd</artifactId>
    <version>1.0-SNAPSHOT</version>
    <parent>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-parent</artifactId>
        <version>2.0.2.RELEASE</version>
    </parent>

    <properties>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <maven.compiler.source>1.8</maven.compiler.source>
        <maven.compiler.target>1.8</maven.compiler.target>
    </properties>
    <dependencies>
        <!--WebMagic核心包-->
        <dependency>
            <groupId>us.codecraft</groupId>
            <artifactId>webmagic-core</artifactId>
            <version>0.7.3</version>
            <exclusions>
                <exclusion>
                    <groupId>org.slf4j</groupId>
                    <artifactId>slf4j-log4j12</artifactId>
                </exclusion>
            </exclusions>
        </dependency>
        <!--WebMagic扩展-->
        <dependency>
            <groupId>us.codecraft</groupId>
            <artifactId>webmagic-extension</artifactId>
            <version>0.7.3</version>
        </dependency>

        <!--工具包-->
        <dependency>
            <groupId>org.apache.commons</groupId>
            <artifactId>commons-lang3</artifactId>
        </dependency>
        <!--SpringMVC-->
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-web</artifactId>
        </dependency>

        <!--SpringData Jpa-->
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-data-jpa</artifactId>
        </dependency>

        <!--单元测试-->
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-test</artifactId>
        </dependency>

        <!--MySQL连接包-->
        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
        </dependency>

        <dependency>
            <groupId>org.seleniumhq.selenium</groupId>
            <artifactId>selenium-java</artifactId>
            <version>3.13.0</version>
        </dependency>

    </dependencies>

</project>

创建实体类、dao

package cn.sgwks.crawlerjd.entity;

import javax.persistence.*;
import java.util.Date;

@Entity
@Table(name = "jd_item")
public class Item {
    @Id
    @GeneratedValue(strategy = GenerationType.IDENTITY)
    private Long id;
    private Long spu;
    private Long sku;
    private String title;
    private Float price;
    private String pic;
    private String url;
    private Date created;
    private Date updated;

    public Long getId() {
        return id;
    }

    public void setId(Long id) {
        this.id = id;
    }

    public Long getSpu() {
        return spu;
    }

    public void setSpu(Long spu) {
        this.spu = spu;
    }

    public Long getSku() {
        return sku;
    }

    public void setSku(Long sku) {
        this.sku = sku;
    }

    public String getTitle() {
        return title;
    }

    public void setTitle(String title) {
        this.title = title;
    }

    public Float getPrice() {
        return price;
    }

    public void setPrice(Float price) {
        this.price = price;
    }

    public String getPic() {
        return pic;
    }

    public void setPic(String pic) {
        this.pic = pic;
    }

    public String getUrl() {
        return url;
    }

    public void setUrl(String url) {
        this.url = url;
    }

    public Date getCreated() {
        return created;
    }

    public void setCreated(Date created) {
        this.created = created;
    }

    public Date getUpdated() {
        return updated;
    }

    public void setUpdated(Date updated) {
        this.updated = updated;
    }
}

package cn.sgwks.crawlerjd.dao;

import cn.sgwks.crawlerjd.entity.Item;
import org.springframework.data.jpa.repository.JpaRepository;

public interface ItemDao extends JpaRepository<Item,Long> {
    Item findBySku(Long sku);
}

编写配置文件和引导类。

#DB Configuration:
spring:
  datasource:
    driver-class-name: com.mysql.jdbc.Driver
    url: jdbc:mysql://127.0.0.1:3306/crawler-sgw?useUnicode=true&characterEncoding=utf8
    username: root
    password: root
  #JPA Configuration:
  jpa:
    database: mysql
    show-sql: true
    generate-ddl: true
    hibernate:
      ddl-auto: update

package cn.sgwks.crawlerjd;

import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;
import org.springframework.scheduling.annotation.EnableScheduling;

@SpringBootApplication
@EnableScheduling
public class SGWApplication {
    public static void main(String[] args) {
        SpringApplication.run(SGWApplication.class, args);
    }
}

package cn.sgwks.crawlerjd.component;

import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.scheduling.annotation.Scheduled;
import org.springframework.stereotype.Component;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.pipeline.Pipeline;
import us.codecraft.webmagic.processor.PageProcessor;
import us.codecraft.webmagic.proxy.Proxy;
import us.codecraft.webmagic.proxy.ProxyProvider;
import us.codecraft.webmagic.proxy.SimpleProxyProvider;

@Component
public class JdSpider {
    private String startUrl = "https://search.jd.com/Search?keyword=%E6%89%8B%E6%9C%BA&enc=utf-8" +
            "&pvid=b618725e7d6846fd98c41d3b55dbf38c";
    //private String startUrl = "https://www.jd.com/";

    @Autowired
    private PageProcessor pageProcessor;
    @Autowired
    private Pipeline pipeline;
    @Autowired
    private JdHttpClientDownloader downloader;
    /**
     * 定时24小时爬一次
     */
    @Scheduled(fixedRate = 1000 * 60 * 60 * 24)
    public void start() {
        //创建一个Downloader组件
        ProxyProvider proxyProvider = SimpleProxyProvider.from(
                new Proxy("39.137.69.6",80),
                new Proxy("39.137.69.7",8080),
                new Proxy("150.138.253.73",808),
                new Proxy("182.92.113.148",8118),
                new Proxy("39.137.69.*",80),
                new Proxy("39.137.69.*",8080),
                new Proxy("150.138.253.**",808),
                new Proxy("182.92.113.***",8118)
        );
        downloader.setProxyProvider(proxyProvider);
        Spider.create(pageProcessor)
                .setDownloader(downloader)
                .addPipeline(pipeline)
                .addUrl(startUrl)
                .start();
    }
}

Downloader

初始化的url
直接访问
页面滚动
取渲染之后的html
封装成Page对象
商品详情页面的url
直接访问
取渲染之后的html
封装成Page对象返回
翻页的url
判断是否url是以“http://nextpage.com”开头
取前一页的url，并访问
点击“下一页”按钮
页面滚动到下方
取渲染之后的html
封装成Page对象返回

package cn.sgwks.crawlerjd.component;

import org.openqa.selenium.WebElement;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.chrome.ChromeOptions;
import org.openqa.selenium.remote.RemoteWebDriver;
import org.springframework.stereotype.Component;
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Request;
import us.codecraft.webmagic.Task;
import us.codecraft.webmagic.downloader.HttpClientDownloader;
import us.codecraft.webmagic.selector.PlainText;

import java.util.List;

@Component
public class JdHttpClientDownloader extends HttpClientDownloader {

    private RemoteWebDriver webDriver;

    public JdHttpClientDownloader(){
        //创建配置参数
        System.setProperty("webdriver.chrome.driver",
                "C:\\Users\\acer\\AppData\\Local\\Google\\Chrome\\Application\\chromedriver.exe");
        ChromeOptions chromeOptions = new ChromeOptions();
        //设置为 headless 模式 （测试不必须）
        chromeOptions.addArguments("--headless");
        //设置浏览器窗口打开大小  （非必须）
        chromeOptions.addArguments("--window-size=1024,768");
        //创建WebDriver对象
        webDriver = new ChromeDriver(chromeOptions);
    }
    public Page download(Request request, Task task) {
        try {
            //取url
            String url = request.getUrl();
            //判断是否是分页 的url
            if (!url.contains("http://nextpage.com")) {
                //1、初始化的url
                //	直接访问
                webDriver.get(url);
                List<WebElement> webElementList = webDriver.findElementsByCssSelector("li.gl-item");
                //判断是是列表页面
                if (webElementList.size() > 0) {
                    //页面滚动
                    webDriver.executeScript("window.scrollTo(0, document.body.scrollHeight - 300)");
                    Thread.sleep(1000);
                    //取渲染之后的html
                    String htmlStr = webDriver.getPageSource();
                    //封装成Page对象
                    return createPage(htmlStr, url);
                } else {
                    //2、商品详情页面的url
                    //	直接访问
                    //	取渲染之后的html
                    String htmlStr = webDriver.getPageSource();
                    //	封装成Page对象返回
                    return createPage(htmlStr, url);
                }
            } else {
                //3、翻页的url
                //	判断是否url是以“http://nextpage.com”开头
                //	取前一页的url，并访问
                String prePageUrl = (String) request.getExtra("url");
                webDriver.get(prePageUrl);
                //	点击“下一页”按钮
                webDriver.findElementByCssSelector("#J_topPage > a.fp-next").click();
                Thread.sleep(1000);
                //	页面滚动到下方
                webDriver.executeScript("window.scrollTo(0, document.body.scrollHeight - 300)");
                Thread.sleep(1000);
                //	取渲染之后的html
                String htmlStr = webDriver.getPageSource();
                //	封装成Page对象返回,第二个参数为详情页面的url
                return createPage(htmlStr, webDriver.getCurrentUrl());
            }
        } catch (Exception e) {
            e.printStackTrace();
        }
        return Page.fail();
    }
    public void setThread(int thread) { /* compiled code */ }

    /**
     * 将html封装成page对象
     * @param html
     * @param url
     * @return
     */
    private Page createPage(String html, String url) {
        Page page = new Page();
        //给page设置html属性
        page.setRawText(html);
        //设置url
        page.setUrl(new PlainText(url));
        //设置request对象
        page.setRequest(new Request(url));
        //设置页面抓取成功
        page.setDownloadSuccess(true);
        return page;
    }
}

PageProcess
1）判断是否是列表页面
2）如果列表页面
3）从列表页面取spu、sku封装到列表中，传递给pipeline
4）取详情页面对象的url列表，添加到访问队列中。
5）创建翻页的url封装成Request对象添加到队列中。
6）如果是详情页面
7）取商品的详细信息，封装到Item对象中，传递给pipeline

package cn.sgwks.crawlerjd.component;

import cn.sgwks.crawlerjd.entity.Item;
import cn.sgwks.crawlerjd.utils.HttpsUtils;
import org.apache.http.HttpEntity;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.springframework.stereotype.Component;
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Request;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.processor.PageProcessor;
import us.codecraft.webmagic.selector.Html;
import us.codecraft.webmagic.selector.Selectable;

import java.io.FileOutputStream;
import java.util.ArrayList;
import java.util.List;
import java.util.UUID;

/**
 * 解析数据
 */
@Component
public class JdPageProcessor implements PageProcessor {
    @Override
    public void process(Page page) {
        //取html对象
        Html html = page.getHtml();
        List<Selectable> nodes = html.css("li.gl-item").nodes();
        //1）判断是否是列表页面
        if (nodes.size() > 0) {
            //2）如果列表页面
            //3）从列表页面取spu、sku封装到列表中，传递给pipeline
            ArrayList<Item> itemList = new ArrayList<>();
            for (Selectable node : nodes) {
                String spu = node.css("li", "data-spu").get();
                String sku = node.css("li", "data-sku").get();
                //封装到Item对象中
                Item item = new Item();
                item.setSpu(Long.parseLong(spu));
                item.setSku(Long.parseLong(sku));
                //添加到列表
                itemList.add(item);
            }
            //把集合对象添加到序列中
            page.putField("itemList", itemList);
            //4）取详情页面对象的url列表，添加到访问队列中。
            List<String> urlList = html.css("li.gl-item div.p-img").links().all();
            page.addTargetRequests(urlList);
            //5）创建翻页的url封装成Request对象添加到队列中。
            String nextPageUrl = "http://nextpage.com?url=" + page.getUrl().get();
            Request request = new Request(nextPageUrl);
            request.putExtra("url", page.getUrl().get());
            page.addTargetRequest(request);
        } else {
            //6）如果是详情页面
            //sku
            String sku = html.css("div.preview-info a.follow.J-follow", "data-id").get();
            //商品标题
            String title = html.css("div.itemInfo-wrap div.sku-name", "text").get();
            //商品价格
            String price = html.css("div.dd span.p-price span.price", "text").get();
            //商品图片
            String picUrl = html.css("#spec-img", "src").get();
            String picTitle = html.css("#spec-img", "alt").get();
            downloadImage(picUrl, picTitle);
            //商品的url
            String itemUrl = page.getUrl().get();
            //7）取商品的详细信息，封装到Item对象中，传递给pipeline
            Item item = new Item();
            item.setSku(Long.parseLong(sku));
            item.setTitle(title);
            item.setPrice(Float.parseFloat(price));
            item.setPic(picUrl);
            item.setUrl(itemUrl);
            //传递给pipeline
            page.putField("item", item);
        }
    }

    @Override
    public Site getSite() {
        return Site.me();
    }

    /**
     * 图片下载
     * @param imgUrl
     * @return
     */
    private static void downloadImage(String imgUrl, String title) {
        try {
            //创建一个HttpClient对象
            CloseableHttpClient httpClient = HttpsUtils.getHttpClient();
            //创建一个HttpGet对象
            HttpGet get = new HttpGet("https:"+imgUrl);
            //发送请求
            CloseableHttpResponse response = httpClient.execute(get);
            //接收服务端响应的内容。
            HttpEntity entity = response.getEntity();
            //需要截取扩展名
            String extName = imgUrl.substring(imgUrl.lastIndexOf("."));
            //需要生成文件名。可以使用uuid生成文件名。并去除特殊字符
            String regEx = "[\n`~!@#$%^&*()+=|{}':;',\\[\\].<>/?~！@#￥%……&*（）——+|{}【】‘；：”“’。， 、？]";
            //需要生成文件名。可以使用uuid生成文件名。
            String uuid = UUID.randomUUID().toString().substring(0, 5);
            String prefix = title.replaceAll(regEx, "");
            String fileName = prefix.substring(0, 15) + uuid + extName;
            //存放地址 C:\Users\acer\Desktop\jdPhone
            //创建一个文件输出流，把文件保存到磁盘
            FileOutputStream fos = new FileOutputStream("C:\\Users\\acer\\Desktop\\jdPhone\\" + fileName);
            //接收流，把内容保存到磁盘。
            entity.writeTo(fos);
            //关闭流
            fos.close();
            //关闭Response对象
            response.close();
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

工具类 HttpsUtils

package cn.sgwks.crawlerjd.utils;

import org.apache.http.config.Registry;
import org.apache.http.config.RegistryBuilder;
import org.apache.http.conn.socket.ConnectionSocketFactory;
import org.apache.http.conn.socket.PlainConnectionSocketFactory;
import org.apache.http.conn.ssl.NoopHostnameVerifier;
import org.apache.http.conn.ssl.SSLConnectionSocketFactory;
import org.apache.http.conn.ssl.TrustStrategy;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.impl.conn.PoolingHttpClientConnectionManager;
import org.apache.http.ssl.SSLContextBuilder;

import java.security.cert.CertificateException;
import java.security.cert.X509Certificate;

public class HttpsUtils {
    private static final String HTTP = "http";
    private static final String HTTPS = "https";
    private static SSLConnectionSocketFactory sslsf = null;
    private static PoolingHttpClientConnectionManager cm = null;
    private static SSLContextBuilder builder = null;
    static {
        try {
            builder = new SSLContextBuilder();
            // 全部信任 不做身份鉴定
            builder.loadTrustMaterial(null, new TrustStrategy() {
                @Override
                public boolean isTrusted(X509Certificate[] x509Certificates, String s) throws CertificateException {
                    return true;
                }
            });
            sslsf = new SSLConnectionSocketFactory(builder.build(), new String[]{"SSLv2Hello", "SSLv3", "TLSv1", "TLSv1.2"}, null, NoopHostnameVerifier.INSTANCE);
            Registry<ConnectionSocketFactory> registry = RegistryBuilder.<ConnectionSocketFactory>create()
                    .register(HTTP, new PlainConnectionSocketFactory())
                    .register(HTTPS, sslsf)
                    .build();
            cm = new PoolingHttpClientConnectionManager(registry);
            cm.setMaxTotal(200);//max connection
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    public static CloseableHttpClient getHttpClient() throws Exception {
        CloseableHttpClient httpClient = HttpClients.custom()
                .setSSLSocketFactory(sslsf)
                .setConnectionManager(cm)
                .setConnectionManagerShared(true)
                .build();
        return httpClient;
    }
}

pipeline
从resultItems对象中取数据
1、取列表数据
2、如果列表数据不为null
3、把列表数据插入到数据库中。
4、取商品数据
5、如果商品数据不为null
6、根据sku查询数据
7、更新数据
8、保存到数据库

package cn.sgwks.crawlerjd.component;

import cn.sgwks.crawlerjd.dao.ItemDao;
import cn.sgwks.crawlerjd.entity.Item;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.stereotype.Component;
import us.codecraft.webmagic.ResultItems;
import us.codecraft.webmagic.Task;
import us.codecraft.webmagic.pipeline.Pipeline;

import java.util.Date;
import java.util.List;

/**
 * 数据持久化组件
 */
@Component
public class JdPipeline implements Pipeline {

    @Autowired
    private ItemDao itemDao;

    @Override
    public void process(ResultItems resultItems, Task task) {
        //从resultItems对象中取数据
        //1、取列表数据
        List<Item> itemList = resultItems.get("itemList");
        //2、如果列表数据不为null
        if (itemList != null) {
            //3、把列表数据插入到数据库中。
            for (Item item : itemList) {
                item.setCreated(new Date());
                item.setUpdated(new Date());
                itemDao.save(item);
            }
        }
        //4、取商品数据
        Item item = resultItems.get("item");
        //5、如果商品数据不为null
        if (item != null) {
            //6、根据sku查询数据
            Item item1 = itemDao.findBySku(item.getSku());
            //7、更新数据
            item1.setTitle(item.getTitle());
            item1.setPrice(item.getPrice());
            item1.setPic(item.getPic());
            item1.setUrl(item.getUrl());
            item1.setUpdated(new Date());
            //8、保存到数据库
            itemDao.save(item1);
        }
    }
}

来源：CSDN

作者：sgwks

链接：https://blog.csdn.net/qq_41821006/article/details/104716494

标签

python

框架