node爬虫

爬虫介绍

爬取接口
- 使用 axios
- 使用与接口类型的爬取

爬取页面

使用 request + cheerio

适用于后端渲染，直接返回 HTML 页面的情况

cheerio 使用方法类似于 jQuery
文档

关于request 转码问题

const cheerio = require('cheerio')
const requuest = require('request-promise')

// 转码问题
let iconv = require('iconv-lite')

let url = 'http://top.baidu.com/category?c=10&fr=topindex'
let options = {
  url,
  encoding: null // 告诉 request 不要帮我把 buffer 转成字符串
}

request(options, async (err, response, body) => {
  // console.log(body.toString())	// 默认转 utf8 编码

  // 获取返回的编码格式
  let ContentType = response.headers['content-type']
  let encoding;
  if (ContentType.lastIndexOf('=') > 0) {
    encoding = ContentType.slice(ContentType.lastIndexOf('=') + 1) || 'utf8'
  } else {
    let b = body.toString()
    let res = b.match(/charset=(.+?)"/)
    if (res) {
      encoding = res[1]
    }
  }
  if (!encoding) {
    encoding = 'utf8'
  }

  body = iconv.decode(body, encoding)
  let $ = cheerio.load(body)
  let result = []
  $('a.list-title').each((index, item) => {
    let $this = $(item)
    result.push($this.text())
  })
  console.log(result)
})

   >  DEMO

   ```javascript
   const cheerio = require('cheerio')
   const requuest = require('request')
   
   let url = 'https://juejin.im/timeline'
   
   request(url, async (err, responst, body) => {
     // console.log(body);
     let $ = await cheerio.load(body)
     let Arrays = []
     let actions = $(".sale")
   
     actions.each((index, item) => {
       Arrays.push({
         text: actions.text(),
       })
     })
     console.log(Arrays);
   })
   
   /**
   [ 
   	{ text: '最高可省 15 元最高可省 15 元最高可省 15 元' },
     	{ text: '最高可省 15 元最高可省 15 元最高可省 15 元' },
     	{ text: '最高可省 15 元最高可省 15 元最高可省 15 元' } 
    ] 
    **/

```

前后端分离方式渲染的页面例如 vue react
- 使用 puppeteer
- puppeteer 文档
- puppeteer 中文文档

爬虫步骤

发起 HTTP 请求获取网页内容
使用类似 Jquery 的语法来操作网页提取需要的数据
吧数据保存到数据库中以供查询
建立一个服务器来显示这些数据
可以定时爬取数据
让程序稳定运行
对编码进行转换

核心类库

request

发送邮件

nodemailer 一个简易的 Node.js 邮件发送模块

解决乱码问题

检查 content-type 编码格式
检查是否使用了代码压缩问题 content-encoding

解决代码压缩问题

how to unzip gzip response in request

请求返回 504 Bad Gateway 目标网站阻止了这类的访问，只要在请求中加上伪装成浏览器的header就可以了

// 请求头中增加伪装信息
headers = {  
    'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'  
}

来源：CSDN

作者：lijie627239856

链接：https://blog.csdn.net/lijie627239856/article/details/103598697

标签

cheerio

node

require