web-crawler

Close all goroutines when HTTP request is cancelled

大城市里の小女人 提交于 2020-01-14 06:50:12
问题 I am making a web crawler. I'm passing the url through a crawler function and parsing it to get all the links in the anchor tag, then I am invoking same crawler function for all those urls using seperate goroutine for every url. But if if send a request and cancel it before I get the response, all the groutines for that particular request are still running. Now what I want is that when I cancel the request all the goroutines that got invoked due to that request stops. Please guide. Following

How to prevent Scrapy from URL encoding request URLs

让人想犯罪 __ 提交于 2020-01-13 08:44:11
问题 I would like Scrapy to not URL encode my Requests. I see that scrapy.http.Request is importing scrapy.utils.url which imports w3lib.url which contains the variable _ALWAYS_SAFE_BYTES. I just need to add a set of characters to _ALWAYS_SAFE_BYTES but I am not sure how to do that from within my spider class. scrapy.http.Request relevant line: fp.update(canonicalize_url(request.url)) canonicalize_url is from scrapy.utils.url, relevant line in scrapy.utils.url: path = safe_url_string(_unquotepath

Can i store html content of webpage in storm crawler?

眉间皱痕 提交于 2020-01-13 06:57:09
问题 I am using strom-crawler-elastic. I can able to see the fetched urls and status of those. Configuration change in ES_IndexInit.sh file gives only url,title, host, text. But can i store the entire html content with html tags ? 回答1: The ES IndexerBolt gets the content of pages from the ParseFilter but does not do anything with it. One option would be to modify the code so that it pulls the content field from the incoming tuples and indexes it. Alternatively, you could implement a custom

Logic for Implementing a Dynamic Web Scraper in C#

扶醉桌前 提交于 2020-01-13 06:21:31
问题 I am looking to develop a Web scraper in C# window forms. What I am trying to accomplish is as follows: Get the URL from the user. Load the Web page in the IE UI control(embedded browser) in WINForms. Allow the User to select a text (contiguous , small(not exceeding 50 chars)). from the loaded web page. When the User wishes to persist the location ( the HTML DOM location ) it has to be persisted into the DB, so that the user may use that location to fetch the data in that location during his

Scrapy and Xpath to extract data from javascript code

丶灬走出姿态 提交于 2020-01-13 05:57:08
问题 I am in the process of learning and simultaneously building a web spider using scrapy. I need help with extracting some information from the following javascript code: <script language="JavaScript" type="text/javascript+gk-onload"> SKART = (SKART) ? SKART : {}; SKART.analytics = SKART.analytics || {}; SKART.analytics["category"] = "television"; SKART.analytics["vertical"] = "television"; SKART.analytics["supercategory"] = "homeentertainmentlarge"; SKART.analytics["subcategory"] = "television"

Visit Half Million Pages with Perl

ぃ、小莉子 提交于 2020-01-13 05:20:07
问题 Currently I'm using Mechanize and the get() method to get each site, and check with content() method each mainpage for something. I have a very fast computer + 10Mbit connection, and still, it took 9 hours to check 11K sites, which is not acceptable, the problem is, the speed of the get() function, which , obviously, needs to get the page,is there any way to make it faster,maybe to disable something,as I only need to main page html to be checked. Thanks, 回答1: Make queries in parallel instead

If I do everything on my page with Ajax, how can I do Search Engine Optimization?

杀马特。学长 韩版系。学妹 提交于 2020-01-12 08:58:42
问题 How is the relationship between crawlers and ajax applications? Do the web crawlers or browsers read dynamically created meta tags? I thought about: adding anchors to the page creating permalinks to the content dynamically adding meta tags. http://code.google.com/web/ajaxcrawling/docs/learn-more.html 回答1: Update in how Google handles SEO with JavaScript: https://searchengineland.com/tested-googlebot-crawls-javascript-heres-learned-220157 which seems pretty good at this point so I would ignore

Scrapy CrawlSpider + Splash: how to follow links through linkextractor?

青春壹個敷衍的年華 提交于 2020-01-12 07:42:04
问题 I have the following code that is partially working, class ThreadSpider(CrawlSpider): name = 'thread' allowed_domains = ['bbs.example.com'] start_urls = ['http://bbs.example.com/diy'] rules = ( Rule(LinkExtractor( allow=(), restrict_xpaths=("//a[contains(text(), 'Next Page')]") ), callback='parse_item', process_request='start_requests', follow=True), ) def start_requests(self): for url in self.start_urls: yield SplashRequest(url, self.parse_item, args={'wait': 0.5}) def parse_item(self,

How to protect/monitor your site from crawling by malicious user

三世轮回 提交于 2020-01-12 04:06:12
问题 Situation: Site with content protected by username/password (not all controlled since they can be trial/test users) a normal search engine can't get at it because of username/password restrictions a malicious user can still login and pass the session cookie to a "wget -r" or something else. The question would be what is the best solution to monitor such activity and respond to it (considering the site policy is no-crawling/scraping allowed) I can think of some options: Set up some traffic

Does Facebook know I'm scraping it with PhantomJS and can it change its website to counter me?

為{幸葍}努か 提交于 2020-01-11 12:58:26
问题 So, maybe I'm being paranoid. I'm scraping my Facebook timeline for a hobby project using PhantomJS. Basically, I wrote a program that finds all of my ads by querying the page for the text Sponsored with XPATH inside of phantom's page.evaluate block. The text was being displayed as innerHTML of html a elements. Things were working great for a few days and it was finding tons of ads. Then it stopped returning any results. When I logged into Facebook manually to inspect the elements again, I