Submit form that renders dynamically with Scrapy?

血红的双手。 提交于 2020-01-13 06:43:08

问题


I'm trying to submit a dynamically generated user login form using Scrapy and then parse the HTML on the page that corresponds to a successful login.

I was wondering how I could do that with Scrapy or a combination of Scrapy and Selenium. Selenium makes it possible to find the element on the DOM, but I was wondering if it would be possible to "give control back" to Scrapy after getting the full HTML in order to allow it to carry out the form submission and save the necessary cookies, session data etc. in order to scrape the page.

Basically, the only reason I thought Selenium was necessary was because I needed the page to render from the Javascript before Scrapy looks for the <form> element. Are there any alternatives to this, however?

Thank you!

Edit: This question is similar to this one, but unfortunately the accepted answer deals with the Requests library instead of Selenium or Scrapy. Though that scenario may be possible in some cases (watch this to learn more), as alecxe points out, Selenium may be required if "parts of the page [such as forms] are loaded via API calls and inserted into the page with the help of javascript code being executed in the browser".


回答1:


Scrapy is not actually a great fit for coursera site since it is extremely asynchronous. Parts of the page are loaded via API calls and inserted into the page with a help of javascript code being executed in the browser. Scrapy is not a browser and cannot handle it.

Which raises the point - why not use the publicly available Coursera API?

Aside from what is documented, there are other endpoints that you can see called in browser developer tools - you need to be authenticated to be able to use them. For example, if you are logged in, you can see the list of courses you've taken:

There is a call to memberships.v1 endpoint.

For the sake of an example, let's start selenium, log in and grab the cookies with get_cookies(). Then, let's yield a Request to memberships.v1 endpoint to get the list of archived courses providing the cookies we've got from selenium:

import json

import scrapy
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


LOGIN = 'email'
PASSWORD = 'password'

class CourseraSpider(scrapy.Spider):
    name = "courseraSpider"
    allowed_domains = ["coursera.org"]

    def start_requests(self):
        self.driver = webdriver.Chrome()
        self.driver.maximize_window()
        self.driver.get('https://www.coursera.org/login')

        form = WebDriverWait(self.driver, 10).until(EC.presence_of_element_located((By.XPATH, "//div[@data-js='login-body']//div[@data-js='facebook-button-divider']/following-sibling::form")))
        email = WebDriverWait(form, 10).until(EC.visibility_of_element_located((By.ID, 'user-modal-email')))
        email.send_keys(LOGIN)

        password = form.find_element_by_name('password')
        password.send_keys(PASSWORD)

        login = form.find_element_by_xpath('//button[. = "Log In"]')
        login.click()

        WebDriverWait(self.driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//h2[. = 'My Courses']")))

        self.driver.get('https://www.coursera.org/')
        cookies = self.driver.get_cookies()

        self.driver.close()

        courses_url = 'https://www.coursera.org/api/memberships.v1'
        params = {
            'fields': 'courseId,enrolledTimestamp,grade,id,lastAccessedTimestamp,role,v1SessionId,vc,vcMembershipId,courses.v1(display,partnerIds,photoUrl,specializations,startDate,v1Details),partners.v1(homeLink,name),v1Details.v1(sessionIds),v1Sessions.v1(active,dbEndDate,durationString,hasSigTrack,startDay,startMonth,startYear),specializations.v1(logo,name,partnerIds,shortName)&includes=courseId,vcMembershipId,courses.v1(partnerIds,specializations,v1Details),v1Details.v1(sessionIds),specializations.v1(partnerIds)',
            'q': 'me',
            'showHidden': 'false',
            'filter': 'archived'
        }

        params = '&'.join(key + '=' + value for key, value in params.iteritems())
        yield scrapy.Request(courses_url + '?' + params, cookies=cookies)

    def parse(self, response):
        data = json.loads(response.body)

        for course in data['linked']['courses.v1']:
            print course['name']

For me, it prints:

Algorithms, Part I
Computing for Data Analysis
Pattern-Oriented Software Architectures for Concurrent and Networked Software
Computer Networks

Which proves that we can give Scrapy the cookies from selenium and successfully extract the data from the "for logged in users only" pages.


Additionally, make sure you don't violate the rules from the Terms of Use, specifically:

In addition, as a condition of accessing the Sites, you agree not to ... (c) use any high-volume, automated or electronic means to access the Sites (including without limitation, robots, spiders, scripts or web-scraping tools);



来源:https://stackoverflow.com/questions/29179519/submit-form-that-renders-dynamically-with-scrapy

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!