问题
I'm very new to Python, Scrapy and Selenium. Thus, any help you could provide would be most appreciated.
I'd like to be able to take HTML I've obtained from Selenium as the page source and processes it into a Scrapy Response object. The main reason is to be able to add the URLs in the Selenium Webdriver page source to the list of URLs Scrapy will parse.
Again, any help would be appreciated.
As a quick second question, does anyone know how to view the list of URLs that are in or were in the list of URLs Scrapy found and scraped?
Thanks!
*******EDIT******* Here is an example of what I am trying to do. I can't figure out part 5.
class AB_Spider(CrawlSpider):
name = "ab_spider"
allowed_domains = ["abcdef.com"]
#start_urls = ["https://www.kickstarter.com/projects/597507018/pebble-e-paper-watch-for-iphone-and-android"
#, "https://www.kickstarter.com/projects/801465716/03-leagues-under-the-sea-the-seaquestor-flyer-subm"]
start_urls = ["https://www.abcdef.com/page/12345"]
def parse_abcs(self, response):
sel = Selector(response)
URL = response.url
#part 1: check if a certain element is on the webpage
last_chk = sel.xpath('//ul/li[@last_page="true"]')
a_len = len(last_chk)
#Part 2: if not, then get page via selenium webdriver
if a_len == 0:
#OPEN WEBDRIVER AND GET PAGE
driver = webdriver.Firefox()
driver.get(response.url)
#Part 3: run script to ineract with page until certain element appears
while a_len == 0:
print "ELEMENT NOT FOUND, USING SELENIUM TO GET THE WHOLE PAGE"
#scroll down one time
driver.execute_script("window.scrollTo(0, 1000000000);")
#get page source and check if last page is there
selen_html = driver.page_source
hxs = Selector(text=selen_html)
last_chk = hxs.xpath('//ul/li[@last_page="true"]')
a_len = len(last_chk)
driver.close()
#Part 4: extract the URLs in the selenium webdriver URL
all_URLS = hxs.xpath('a/@href').extract()
#Part 5: all_URLS add to the Scrapy URLs to be scraped
回答1:
Just yield Request instances from the method and provide a callback:
class AB_Spider(CrawlSpider):
...
def parse_abcs(self, response):
...
all_URLS = hxs.xpath('a/@href').extract()
for url in all_URLS:
yield Request(url, callback=self.parse_page)
def parse_page(self, response):
# Do the parsing here
来源:https://stackoverflow.com/questions/23632198/pass-selenium-html-string-to-scrapy-to-add-urls-to-scrapy-list-of-urls-to-scrape