问题
I am trying to scrape data from the html table, Texas Death Row
I able to pull the existing data from the table using the spider script below:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from texasdeath.items import DeathItem
class DeathSpider(BaseSpider):
name = "death"
allowed_domains = ["tdcj.state.tx.us"]
start_urls = [
"https://www.tdcj.state.tx.us/death_row/dr_executed_offenders.html"
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//table/tbody/tr')
for site in sites:
item = DeathItem()
item['firstName'] = site.select('td[5]/text()').extract()
item['lastName'] = site.select('td[4]/text()').extract()
item['Age'] = site.select('td[7]/text()').extract()
item['Date'] = site.select('td[8]/text()').extract()
item['Race'] = site.select('td[9]/text()').extract()
item['County'] = site.select('td[10]/text()').extract()
yield item
Problem is there also links in the table that I am trying to call and get the data from within the links to be appended to my items.
The Scrapy tutorial here, Scrapy Tutorial seems to have a guide on how to pull data from within a directory. But I am having trouble figuring out how to do get the data from the main page as well as to return me data from links within the table.
回答1:
Instead of yielding an item, yield
a Request
and pass the item
inside meta
. This is covered in the documentation here.
Sample implementation of a spider that would follow the "Offender Information" links if it leads to the offender "details" page (sometimes it leads to an image - in this case the spider would output what it has at the moment):
from urlparse import urljoin
import scrapy
class DeathItem(scrapy.Item):
firstName = scrapy.Field()
lastName = scrapy.Field()
Age = scrapy.Field()
Date = scrapy.Field()
Race = scrapy.Field()
County = scrapy.Field()
Gender = scrapy.Field()
class DeathSpider(scrapy.Spider):
name = "death"
allowed_domains = ["tdcj.state.tx.us"]
start_urls = [
"https://www.tdcj.state.tx.us/death_row/dr_executed_offenders.html"
]
def parse(self, response):
sites = response.xpath('//table/tbody/tr')
for site in sites:
item = DeathItem()
item['firstName'] = site.xpath('td[5]/text()').extract()
item['lastName'] = site.xpath('td[4]/text()').extract()
item['Age'] = site.xpath('td[7]/text()').extract()
item['Date'] = site.xpath('td[8]/text()').extract()
item['Race'] = site.xpath('td[9]/text()').extract()
item['County'] = site.xpath('td[10]/text()').extract()
url = urljoin(response.url, site.xpath("td[2]/a/@href").extract_first())
if url.endswith("html"):
yield scrapy.Request(url, meta={"item": item}, callback=self.parse_details)
else:
yield item
def parse_details(self, response):
item = response.meta["item"]
item["Gender"] = response.xpath("//td[. = 'Gender']/following-sibling::td[1]/text()").extract()
yield item
来源:https://stackoverflow.com/questions/37257870/scrapy-getting-data-from-links-within-table