最近用scrapy来进行网页抓取,对于pythoner来说它用起来非常方便,详细文档在这里:http://doc.scrapy.org/en/0.14/index.html
要想利用scrapy来抓取网页信息,需要先新建一个工程,scrapy startproject myproject
工程建立好后,会有一个myproject/myproject的子目录,里面有item.py(由于你要抓取的东西的定义),pipeline.py(用于处理抓取后的数据,可以保存数据库,或是其他),然后是spiders文件夹,可以在里面编写爬虫的脚本.
这里以爬取某网站的书籍信息为例:
item.py如下:
?
- from scrapy.item import Item, Field
- class BookItem(Item):
- # define the fields for your item here like:
- name = Field()
- publisher = Field()
- publish_date = Field()
- price = Field()
?
我们要抓取的东西都在上面定义好了,分别是名字,出版商,出版日期,价格,
下面就要写爬虫去网战抓取信息了,
spiders/book.py如下:
?
- from urlparse import urljoin
- import simplejson
- from scrapy.http import Request
- from scrapy.contrib.spiders import CrawlSpider, Rule
- from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
- from scrapy.selector import HtmlXPathSelector
- from myproject.items import BookItem
- class BookSpider(CrawlSpider):
- name = 'bookspider'
- allowed_domains = ['test.com']
- start_urls = [
- "http://test_url.com", #这里写开始抓取的页面地址(这里网址是虚构的,实际使用时请替换)
- ]
- rules = (
- #下面是符合规则的网址,但是不抓取内容,只是提取该页的链接(这里网址是虚构的,实际使用时请替换)
- Rule(SgmlLinkExtractor(allow=(r'http://test_url/test?page_index=\d+'))),
- #下面是符合规则的网址,提取内容,(这里网址是虚构的,实际使用时请替换)
- Rule(SgmlLinkExtractor(allow=(r'http://test_rul/test?product_id=\d+')), callback="parse_item"),
- )
- def parse_item(self, response):
- hxs = HtmlXPathSelector(response)
- item = BookItem()
- item['name'] = hxs.select('//div[@class="h1_title book_head"]/h1/text()').extract()[0]
- item['author'] = hxs.select('//div[@class="book_detailed"]/p[1]/a/text()').extract()
- publisher = hxs.select('//div[@class="book_detailed"]/p[2]/a/text()').extract()
- item['publisher'] = publisher and publisher[0] or ''
- publish_date = hxs.select('//div[@class="book_detailed"]/p[3]/text()').re(u"[\u2e80-\u9fffh]+\uff1a([\d-]+)")
- item['publish_date'] = publish_date and publish_date[0] or ''
- prices = hxs.select('//p[@class="price_m"]/text()').re("(\d*\.*\d*)")
- item['price'] = prices and prices[0] or ''
- return item
然后信息抓取后,需要保存,这时就需要写pipelines.py了(用于scapy是用的twisted,所以具体的数据库操作可以看twisted的资料,这里只是简单介绍如何保存到数据库中):
?
- from scrapy import log
- #from scrapy.core.exceptions import DropItem
- from twisted.enterprise import adbapi
- from scrapy.http import Request
- from scrapy.exceptions import DropItem
- from scrapy.contrib.pipeline.images import ImagesPipeline
- import time
- import MySQLdb
- import MySQLdb.cursors
- class MySQLStorePipeline(object):
- def __init__(self):
- self.dbpool = adbapi.ConnectionPool('MySQLdb',
- db = 'test',
- user = 'user',
- passwd = '******',
- cursorclass = MySQLdb.cursors.DictCursor,
- charset = 'utf8',
- use_unicode = False
- )
- def process_item(self, item, spider):
- query = self.dbpool.runInteraction(self._conditional_insert, item)
- query.addErrback(self.handle_error)
- return item
- def _conditional_insert(self, tx, item):
- if item.get('name'):
- tx.execute(\
- "insert into book (name, publisher, publish_date, price ) \
- values (%s, %s, %s, %s)",
- (item['name'], item['publisher'], item['publish_date'],
- item['price'])
- )
完成之后在setting.py中添加该pipeline:
?
- ITEM_PIPELINES = ['myproject.pipelines.MySQLStorePipeline']
?最后运行scrapy crawl bookspider就开始抓取了
?
来源:CSDN
作者:playStudy
链接:https://blog.csdn.net/playStudy/article/details/17304649