Scrapy getting data from links within table

喜欢而已 提交于 2019-12-23 02:46:13

问题


I am trying to scrape data from the html table, Texas Death Row

I able to pull the existing data from the table using the spider script below:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector

from texasdeath.items import DeathItem

class DeathSpider(BaseSpider):
   name = "death"
   allowed_domains = ["tdcj.state.tx.us"]
   start_urls = [
       "https://www.tdcj.state.tx.us/death_row/dr_executed_offenders.html"
   ]



   def parse(self, response):
       hxs = HtmlXPathSelector(response)
       sites = hxs.select('//table/tbody/tr')
       for site in sites:
           item = DeathItem()
           item['firstName'] = site.select('td[5]/text()').extract()
           item['lastName'] = site.select('td[4]/text()').extract()
           item['Age'] = site.select('td[7]/text()').extract()
           item['Date'] = site.select('td[8]/text()').extract()
           item['Race'] = site.select('td[9]/text()').extract()
           item['County'] = site.select('td[10]/text()').extract()
           yield item

Problem is there also links in the table that I am trying to call and get the data from within the links to be appended to my items.

The Scrapy tutorial here, Scrapy Tutorial seems to have a guide on how to pull data from within a directory. But I am having trouble figuring out how to do get the data from the main page as well as to return me data from links within the table.


回答1:


Instead of yielding an item, yield a Request and pass the item inside meta. This is covered in the documentation here.

Sample implementation of a spider that would follow the "Offender Information" links if it leads to the offender "details" page (sometimes it leads to an image - in this case the spider would output what it has at the moment):

from urlparse import urljoin

import scrapy


class DeathItem(scrapy.Item):
    firstName = scrapy.Field()
    lastName = scrapy.Field()
    Age = scrapy.Field()
    Date = scrapy.Field()
    Race = scrapy.Field()
    County = scrapy.Field()
    Gender = scrapy.Field()


class DeathSpider(scrapy.Spider):
    name = "death"
    allowed_domains = ["tdcj.state.tx.us"]
    start_urls = [
        "https://www.tdcj.state.tx.us/death_row/dr_executed_offenders.html"
    ]

    def parse(self, response):
        sites = response.xpath('//table/tbody/tr')
        for site in sites:
            item = DeathItem()

            item['firstName'] = site.xpath('td[5]/text()').extract()
            item['lastName'] = site.xpath('td[4]/text()').extract()
            item['Age'] = site.xpath('td[7]/text()').extract()
            item['Date'] = site.xpath('td[8]/text()').extract()
            item['Race'] = site.xpath('td[9]/text()').extract()
            item['County'] = site.xpath('td[10]/text()').extract()

            url = urljoin(response.url, site.xpath("td[2]/a/@href").extract_first())
            if url.endswith("html"):
                yield scrapy.Request(url, meta={"item": item}, callback=self.parse_details)
            else:
                yield item

    def parse_details(self, response):
        item = response.meta["item"]
        item["Gender"] = response.xpath("//td[. = 'Gender']/following-sibling::td[1]/text()").extract()
        yield item


来源:https://stackoverflow.com/questions/37257870/scrapy-getting-data-from-links-within-table

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!