用scrapy进行网页抓取

最近用scrapy来进行网页抓取,对于pythoner来说它用起来非常方便,详细文档在这里:http://doc.scrapy.org/en/0.14/index.html

要想利用scrapy来抓取网页信息,需要先新建一个工程,scrapy startproject myproject

工程建立好后,会有一个myproject/myproject的子目录,里面有item.py(由于你要抓取的东西的定义),pipeline.py(用于处理抓取后的数据,可以保存数据库,或是其他),然后是spiders文件夹,可以在里面编写爬虫的脚本.

这里以爬取某网站的书籍信息为例:

item.py如下:

[python]view
 plaincopy

from scrapy.item import Item, Field  

class BookItem(Item):  

    # define the fields for your item here like:  

    name = Field()  

    publisher = Field()  

    publish_date = Field()  

    price = Field()

我们要抓取的东西都在上面定义好了,分别是名字,出版商,出版日期,价格,

下面就要写爬虫去网战抓取信息了,

spiders/book.py如下:

[python]view
 plaincopy

from urlparse import urljoin  

import simplejson  

from scrapy.http import Request  

from scrapy.contrib.spiders import CrawlSpider, Rule  

from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor  

from scrapy.selector import HtmlXPathSelector  

from myproject.items import BookItem  

class BookSpider(CrawlSpider):  

    name = 'bookspider'  

    allowed_domains = ['test.com']  

    start_urls = [  

        "http://test_url.com",   #这里写开始抓取的页面地址(这里网址是虚构的,实际使用时请替换)  

    ]  

    rules = (  

        #下面是符合规则的网址,但是不抓取内容,只是提取该页的链接(这里网址是虚构的,实际使用时请替换)  

        Rule(SgmlLinkExtractor(allow=(r'http://test_url/test?page_index=\d+'))),  

        #下面是符合规则的网址,提取内容,(这里网址是虚构的,实际使用时请替换)  

        Rule(SgmlLinkExtractor(allow=(r'http://test_rul/test?product_id=\d+')), callback="parse_item"),  

    )  

    def parse_item(self, response):  

        hxs = HtmlXPathSelector(response)  

        item = BookItem()  

        item['name'] = hxs.select('//div[@class="h1_title book_head"]/h1/text()').extract()[0]  

        item['author'] = hxs.select('//div[@class="book_detailed"]/p[1]/a/text()').extract()  

        publisher = hxs.select('//div[@class="book_detailed"]/p[2]/a/text()').extract()  

        item['publisher'] = publisher and publisher[0] or ''  

        publish_date = hxs.select('//div[@class="book_detailed"]/p[3]/text()').re(u"[\u2e80-\u9fffh]+\uff1a([\d-]+)")  

        item['publish_date'] = publish_date and publish_date[0] or ''  

        prices = hxs.select('//p[@class="price_m"]/text()').re("(\d*\.*\d*)")  

        item['price'] = prices and prices[0] or ''  

        return item

然后信息抓取后,需要保存,这时就需要写pipelines.py了(用于scapy是用的twisted,所以具体的数据库操作可以看twisted的资料,这里只是简单介绍如何保存到数据库中):

[python]view
 plaincopy

from scrapy import log  

#from scrapy.core.exceptions import DropItem  

from twisted.enterprise import adbapi  

from scrapy.http import Request  

from scrapy.exceptions import DropItem  

from scrapy.contrib.pipeline.images import ImagesPipeline  

import time  

import MySQLdb  

import MySQLdb.cursors  

class MySQLStorePipeline(object):  

    def __init__(self):  

        self.dbpool = adbapi.ConnectionPool('MySQLdb',  

                db = 'test',  

                user = 'user',  

                passwd = '******',  

                cursorclass = MySQLdb.cursors.DictCursor,  

                charset = 'utf8',  

                use_unicode = False  

        )  

    def process_item(self, item, spider):  

        query = self.dbpool.runInteraction(self._conditional_insert, item)  

        query.addErrback(self.handle_error)  

        return item  

    def _conditional_insert(self, tx, item):  

        if item.get('name'):  

            tx.execute(\  

                "insert into book (name, publisher, publish_date, price ) \  

                 values (%s, %s, %s, %s)",  

                (item['name'],  item['publisher'], item['publish_date'],   

                item['price'])  

            )

完成之后在setting.py中添加该pipeline:

[python]view
 plaincopy

ITEM_PIPELINES = ['myproject.pipelines.MySQLStorePipeline']  

?最后运行scrapy crawl bookspider就开始抓取了

本文地址http://www.chengxuyuans.com/Python/39302.html

来源：CSDN

作者：playStudy

链接：https://blog.csdn.net/playStudy/article/details/17304649

标签

网页抓取

scrapy

python