Update: 7/29, 9:29pm: After reading this post, I updated my code.
UPDATE: 7/28/15, at 7:35pm, following Martin's suggestion, the message changed, but still no listing of items or writing to database.
ORIGINAL: I can successfully scrape a single page (the base page). Now I tried to scrape one of the items from another url found from the "base" page, using Request and callback command. But it does not work. The spider is here:
from scrapy.spider import Spider
from scrapy.selector import Selector
from scrapy import Request
import re
from datetime import datetime, timedelta
from CAPjobs.items import CAPjobsItem
from CAPjobs.items import CAPjobsItemLoader
from scrapy.contrib.loader.processor import MapCompose, Join
class CAPjobSpider(Spider):
name = "naturejob3"
download_delay = 2
#allowed_domains = ["nature.com/naturejobs/"]
start_urls = [
"http://www.nature.com/naturejobs/science/jobs?utf8=%E2%9C%93&q=pathologist&where=&commit=Find+Jobs"]
def parse_subpage(self, response):
il = response.meta['il']
il.add_xpath('loc_pj', '//div[@id="extranav"]/div/dl/dd[2]/ul/li/text()')
yield il.load_item()
def parse(self, response):
hxs = Selector(response)
sites = hxs.xpath('//div[@class="job-details"]')
for site in sites:
il = CAPjobsItemLoader(CAPjobsItem(), selector = site)
il.add_xpath('title', 'h3/a/text()')
il.add_xpath('post_date', 'normalize-space(ul/li[@class="when"]/text())')
il.add_xpath('web_url', 'concat("http://www.nature.com", h3/a/@href)')
url = il.get_output_value('web_url')
yield Request(url, meta={'il': il}, callback=self.parse_subpage)
Now the scraping is partially functioning, but no loc_pj
item: (UPDATE on 7/29, 7:35pm)
2015-07-29 21:28:24 [scrapy] DEBUG: Scraped from <200 http://www.nature.com/naturejobs/science/jobs/535683-assistant-associate-full-hs-clinical-clin-x-anatomic-pathology-cytopathology-11-000>
{'post_date': u'21 days ago',
'title': u'Assistant, Associate, Full (HS Clinical, Clin X) - Anatomic Pathology/Cytopathology (11-000)',
'web_url': u'http://www.nature.com/naturejobs/science/jobs/535683-assistant-associate-full-hs-clinical-clin-x-anatomic-pathology-cytopathology-11-000'}
You initialize the ItemLoader
like so:
il = CAPjobsItemLoader(CAPjobsItem, sites)
In the documentation it is done like so:
l = ItemLoader(item=Product(), response=response)
So I think you're missing parentheses at the CAPjobsItem
and your line should read:
il = CAPjobsItemLoader(CAPjobsItem(), sites)
来源:https://stackoverflow.com/questions/31667885/scrapy-can-not-scrape-a-second-page-using-itemloader