web-crawler | 易学教程

Selenium with PhantomJS: Form being validated but not submitted

阅读更多关于 Selenium with PhantomJS: Form being validated but not submitted

问题 I'm having a strange problem submitting a form through Selenium Webdriver's PhantomJS API. Upon clicking the submit button, the form gets validated (are the username and password too short, or blank, etc.), but it does not get ultimately submitted. That is, if I submit an invalid form, and check the screenshot, there are alert notifications. If I submit a valid form, nothing happens. The JS on the page is supposed to validate the form, then submit it, when the submit button is clicked. A

Selenium with PhantomJS: Form being validated but not submitted

阅读更多关于 Selenium with PhantomJS: Form being validated but not submitted

DNS lookup failed: address 'your.proxy.com' not found: [Errno -5] No address associated with hostname

阅读更多关于 DNS lookup failed: address 'your.proxy.com' not found: [Errno -5] No address associated with hostname

问题 This question is an extension of the resolved question here, ie. Crawling linkedin while authenticating with scrapy. Crawling LinkedIn while authenticated with Scrapy @Gates While I keep the base of the script the same, only adding my own session_key and session_password - and after changing the start url particular to my use-case, as below. class LinkedPySpider(InitSpider): name = 'Linkedin' allowed_domains = ['linkedin.com'] login_page = 'https://www.linkedin.com/uas/login' start_urls=[

DNS lookup failed: address 'your.proxy.com' not found: [Errno -5] No address associated with hostname

阅读更多关于 DNS lookup failed: address 'your.proxy.com' not found: [Errno -5] No address associated with hostname

“TypeError: 'Rule' object is not iterable” webscraping an .aspx page in python

阅读更多关于 “TypeError: 'Rule' object is not iterable” webscraping an .aspx page in python

问题 I am using the following code to scrape this website (http://profiles.ehs.state.ma.us/Profiles/Pages/ChooseAPhysician.aspx?Page=1) ; however, obtain the following TypeError: "File "C:\Users\Anaconda2\lib\site-packages\scrapy\contrib\spiders\crawl.py", line 83, in _compile_rules self._rules = [copy.copy(r) for r in self.rules] TypeError: 'Rule' object is not iterable" I don't have any code written on line 83, thus, wondering if anyone has ideas on how to resolve the issue? I'm using Python 2.7

“TypeError: 'Rule' object is not iterable” webscraping an .aspx page in python

阅读更多关于 “TypeError: 'Rule' object is not iterable” webscraping an .aspx page in python

Selecting an option from an optgroup with selenium and python

阅读更多关于 Selecting an option from an optgroup with selenium and python

问题 I would like to choose a value from this optgroup which should then give me a dropdown of links. <div class="searchbar"> <select id="q" multiple="" tabindex="-1" class="select2-hidden-accessible" aria-hidden="true"> <option></option> <option class="q-all-text" value="al:all">Search all text</option> <optgroup label="Business Type"> <option value="bt:Buyer">Buyer</option> <option value="bt:Farmer/Rancher">Farmer/Rancher</option> <option value="bt:Farmers Market">Farmers Market</option> <option

crawl a list of sites one by one with scrapy

阅读更多关于 crawl a list of sites one by one with scrapy

问题 I am trying to crawl a list of sites with scrapy . I tried to put the list of website urls as the start_urls , but then I found I couldn't afford so much memory with it. Is there any way to set the scrapy crawling one or two sites at a time? 回答1: You can try using concurrent_requests = 1 so that you don't overloaded with data http://doc.scrapy.org/en/latest/topics/settings.html#concurrent-requests 回答2: You can define a start_requests method which iterates through requests to your URLs. This

How can I get MediaWiki to ignore page views from a Google Search Appliance?

阅读更多关于 How can I get MediaWiki to ignore page views from a Google Search Appliance?

问题 The page view counter on each MediaWiki page seems like a great way to identify popular pages which are worth putting more effort into keeping up-to-date and useful, but I've hit a problem. We use a Google Search Appliance to index our MediaWiki installation. The problem I have is that the GSA increments the page view counter each time it crawls the page. This completely dominates the statistics, swamping the views made by real users. I know how to reset the page counters to start again. But

Scrapy is following and scraping non-allowed links

阅读更多关于 Scrapy is following and scraping non-allowed links

问题 I have a CrawlSpider set up to following certain links and scrape a news magazine where the links to each issue follow the following URL scheme: http://example.com/YYYY/DDDD/index.htm where YYYY is the year and DDDD is the three or four digit issue number. I only want issues 928 onwards, and have my rules below. I don't have any problem connecting to the site, crawling links, or extracting items (so I didn't include the rest of my code). The spider seems determined to follow non-allowed links