web-crawler

Scrapy run multiple spiders from a script

雨燕双飞 提交于 2021-01-29 15:53:31
问题 Hey following question: I'm having a script I want Scrapy spiders to start from. For that I used a solution from another stack overflow post to integrate the settings so I don't have to overwrite them manually. So until now I'm able to start two crawlers from outside the Scrapy project: from scrapy_bots.update_Database.update_Database.spiders.m import M from scrapy_bots.update_Database.update_Database.spiders.p import P from scrapy.crawler import CrawlerProcess from scrapy.utils.project

How to extract javascript links in an HTML document?

匆匆过客 提交于 2021-01-29 06:11:20
问题 I am writing a small webspider for a website which uses a lot of javascript for links: <htmlTag onclick="someFunction();">Click here</htmlTag> where the function looks like: function someFunction() { var _url; ... // _url constructed, maybe with reference to a value in the HTML doc // and/or a value passed as argument(s) to this function ... window.location.href = _url; } What is the best way of evaluating this function server-side so I can construct the value for _url? 回答1: You could also

Why does Selenium get the child elements slowly

可紊 提交于 2021-01-29 02:13:21
问题 For example, HTML: <input type="hidden" name="ie" value="utf-8"> this element don't have child element, when I use code: List<WebElement> childElements = ele.findElements(By.xpath("./*")); the program uses a very long time (about 30s) return a result. And the result size is right which is zero. So how can I resolve this problem? Thanks. 回答1: As per the documentation findElements() method is affected by the implicit wait duration in force at the time of execution. When implicitly waiting,

How to Specify different Process settings for two different spiders in CrawlerProcess Scrapy?

房东的猫 提交于 2021-01-28 16:42:05
问题 I have two spiders which I want to execute in parallel . I used the CrawlerProcess instance and its crawl method to acheieve this. However, I want to specify different output file , ie FEED_URI for each spider in the same process . I tried to loop the spiders and run them as shown below. Though two different output files are generated, the process terminates as soon as the second spider completes execution . If the first spider completes crawling before the second one, I get the desired

500 Internal server error scrapy

你。 提交于 2021-01-28 05:26:33
问题 I am using scrapy to crawl a product website which over 4 million products. However after crawling around 50k products it starts throwing 500 HTTP error. I have set Auto throttling to false as after enabling its very slow and will take around 20-25 days to complete the scraping. I think the server starts blocking the crawler temporarily after sometime. Any solutions what can be done ? I am using sitemap crawler - I want to extract some information from the url itself if the server is not

YouTube Data API to crawl all comments and replies

允我心安 提交于 2021-01-18 06:52:50
问题 I have been desperately seeking a solution to crawl all comments and corresponding replies for my research. Am having a very hard time creating a data frame that includes comment data in correct and corresponding orders. I am gonna share my code here so you professionals can take a look and give me some insights. def get_video_comments(service, **kwargs): comments = [] results = service.commentThreads().list(**kwargs).execute() while results: for item in results['items']: comment = item[

Limited number of scraped data?

时光总嘲笑我的痴心妄想 提交于 2021-01-07 02:51:06
问题 I am scraping a website and everything seems work fine from today's news until news published in 2015/2016. After these years, I am not able to scrape news. Could you please tell me if anything has changed? I should get 672 pages getting titles and snippets from this page: https://catania.liveuniversity.it/attualita/ but I have got approx. 158. The code that I am using is: import bs4, requests import pandas as pd import re headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11

Send parallel requests but only one per host with HttpClient and Polly to gracefully handle 429 responses

空扰寡人 提交于 2020-12-31 04:31:08
问题 Intro: I am building a single-node web crawler to simply validate URLs are 200 OK in a .NET Core console application. I have a collection of URLs at different hosts to which I am sending requests with HttpClient . I am fairly new to using Polly and TPL Dataflow. Requirements: I want to support sending multiple HTTP requests in parallel with a configurable MaxDegreeOfParallelism . I want to limit the number of parallel requests to any given host to 1 (or configurable). This is in order to

How to get all links from the DOM?

时光怂恿深爱的人放手 提交于 2020-12-29 06:51:22
问题 According to https://github.com/GoogleChrome/puppeteer/issues/628, I should be able to get all links from < a href="xyz" > with this single line: const hrefs = await page.$$eval('a', a => a.href); But when I try a simple: console.log(hrefs) I only get: http://example.de/index.html ... as output which means that it could only find 1 link? But the page definitely has 12 links in the source code / DOM. Why does it fail to find them all? Minimal example: 'use strict'; const puppeteer = require(

Scrapy parse pagination without next link

删除回忆录丶 提交于 2020-12-13 03:36:41
问题 I'm trying to parse a pagination without next link. The html is belove: <div id="pagination" class="pagination"> <ul> <li> <a href="//www.demopage.com/category_product_seo_name" class="page-1 ">1</a> </li> <li> <a href="//www.demopage.com/category_product_seo_name?page=2" class="page-2 ">2</a> </li> <li> <a href="//www.demopage.com/category_product_seo_name?page=3" class="page-3 ">3</a> </li> <li> <a href="//www.demopage.com/category_product_seo_name?page=4" class="page-4 active">4</a> </li>