web-crawler

Selenium with PhantomJS: Form being validated but not submitted

家住魔仙堡 提交于 2020-01-15 18:48:34
问题 I'm having a strange problem submitting a form through Selenium Webdriver's PhantomJS API. Upon clicking the submit button, the form gets validated (are the username and password too short, or blank, etc.), but it does not get ultimately submitted. That is, if I submit an invalid form, and check the screenshot, there are alert notifications. If I submit a valid form, nothing happens. The JS on the page is supposed to validate the form, then submit it, when the submit button is clicked. A

Selenium with PhantomJS: Form being validated but not submitted

允我心安 提交于 2020-01-15 18:48:22
问题 I'm having a strange problem submitting a form through Selenium Webdriver's PhantomJS API. Upon clicking the submit button, the form gets validated (are the username and password too short, or blank, etc.), but it does not get ultimately submitted. That is, if I submit an invalid form, and check the screenshot, there are alert notifications. If I submit a valid form, nothing happens. The JS on the page is supposed to validate the form, then submit it, when the submit button is clicked. A

DNS lookup failed: address 'your.proxy.com' not found: [Errno -5] No address associated with hostname

自闭症网瘾萝莉.ら 提交于 2020-01-15 12:24:11
问题 This question is an extension of the resolved question here, ie. Crawling linkedin while authenticating with scrapy. Crawling LinkedIn while authenticated with Scrapy @Gates While I keep the base of the script the same, only adding my own session_key and session_password - and after changing the start url particular to my use-case, as below. class LinkedPySpider(InitSpider): name = 'Linkedin' allowed_domains = ['linkedin.com'] login_page = 'https://www.linkedin.com/uas/login' start_urls=[

DNS lookup failed: address 'your.proxy.com' not found: [Errno -5] No address associated with hostname

好久不见. 提交于 2020-01-15 12:24:08
问题 This question is an extension of the resolved question here, ie. Crawling linkedin while authenticating with scrapy. Crawling LinkedIn while authenticated with Scrapy @Gates While I keep the base of the script the same, only adding my own session_key and session_password - and after changing the start url particular to my use-case, as below. class LinkedPySpider(InitSpider): name = 'Linkedin' allowed_domains = ['linkedin.com'] login_page = 'https://www.linkedin.com/uas/login' start_urls=[

“TypeError: 'Rule' object is not iterable” webscraping an .aspx page in python

别等时光非礼了梦想. 提交于 2020-01-15 11:17:09
问题 I am using the following code to scrape this website (http://profiles.ehs.state.ma.us/Profiles/Pages/ChooseAPhysician.aspx?Page=1) ; however, obtain the following TypeError: "File "C:\Users\Anaconda2\lib\site-packages\scrapy\contrib\spiders\crawl.py", line 83, in _compile_rules self._rules = [copy.copy(r) for r in self.rules] TypeError: 'Rule' object is not iterable" I don't have any code written on line 83, thus, wondering if anyone has ideas on how to resolve the issue? I'm using Python 2.7

“TypeError: 'Rule' object is not iterable” webscraping an .aspx page in python

≯℡__Kan透↙ 提交于 2020-01-15 11:15:46
问题 I am using the following code to scrape this website (http://profiles.ehs.state.ma.us/Profiles/Pages/ChooseAPhysician.aspx?Page=1) ; however, obtain the following TypeError: "File "C:\Users\Anaconda2\lib\site-packages\scrapy\contrib\spiders\crawl.py", line 83, in _compile_rules self._rules = [copy.copy(r) for r in self.rules] TypeError: 'Rule' object is not iterable" I don't have any code written on line 83, thus, wondering if anyone has ideas on how to resolve the issue? I'm using Python 2.7

Selecting an option from an optgroup with selenium and python

好久不见. 提交于 2020-01-15 11:06:33
问题 I would like to choose a value from this optgroup which should then give me a dropdown of links. <div class="searchbar"> <select id="q" multiple="" tabindex="-1" class="select2-hidden-accessible" aria-hidden="true"> <option></option> <option class="q-all-text" value="al:all">Search all text</option> <optgroup label="Business Type"> <option value="bt:Buyer">Buyer</option> <option value="bt:Farmer/Rancher">Farmer/Rancher</option> <option value="bt:Farmers Market">Farmers Market</option> <option

crawl a list of sites one by one with scrapy

∥☆過路亽.° 提交于 2020-01-15 10:33:51
问题 I am trying to crawl a list of sites with scrapy . I tried to put the list of website urls as the start_urls , but then I found I couldn't afford so much memory with it. Is there any way to set the scrapy crawling one or two sites at a time? 回答1: You can try using concurrent_requests = 1 so that you don't overloaded with data http://doc.scrapy.org/en/latest/topics/settings.html#concurrent-requests 回答2: You can define a start_requests method which iterates through requests to your URLs. This

How can I get MediaWiki to ignore page views from a Google Search Appliance?

两盒软妹~` 提交于 2020-01-14 20:41:52
问题 The page view counter on each MediaWiki page seems like a great way to identify popular pages which are worth putting more effort into keeping up-to-date and useful, but I've hit a problem. We use a Google Search Appliance to index our MediaWiki installation. The problem I have is that the GSA increments the page view counter each time it crawls the page. This completely dominates the statistics, swamping the views made by real users. I know how to reset the page counters to start again. But

Scrapy is following and scraping non-allowed links

痞子三分冷 提交于 2020-01-14 08:58:54
问题 I have a CrawlSpider set up to following certain links and scrape a news magazine where the links to each issue follow the following URL scheme: http://example.com/YYYY/DDDD/index.htm where YYYY is the year and DDDD is the three or four digit issue number. I only want issues 928 onwards, and have my rules below. I don't have any problem connecting to the site, crawling links, or extracting items (so I didn't include the rest of my code). The spider seems determined to follow non-allowed links