web-crawler | 易学教程

Close all goroutines when HTTP request is cancelled

阅读更多关于 Close all goroutines when HTTP request is cancelled

问题 I am making a web crawler. I'm passing the url through a crawler function and parsing it to get all the links in the anchor tag, then I am invoking same crawler function for all those urls using seperate goroutine for every url. But if if send a request and cancel it before I get the response, all the groutines for that particular request are still running. Now what I want is that when I cancel the request all the goroutines that got invoked due to that request stops. Please guide. Following

How to prevent Scrapy from URL encoding request URLs

阅读更多关于 How to prevent Scrapy from URL encoding request URLs

问题 I would like Scrapy to not URL encode my Requests. I see that scrapy.http.Request is importing scrapy.utils.url which imports w3lib.url which contains the variable _ALWAYS_SAFE_BYTES. I just need to add a set of characters to _ALWAYS_SAFE_BYTES but I am not sure how to do that from within my spider class. scrapy.http.Request relevant line: fp.update(canonicalize_url(request.url)) canonicalize_url is from scrapy.utils.url, relevant line in scrapy.utils.url: path = safe_url_string(_unquotepath

Can i store html content of webpage in storm crawler?

阅读更多关于 Can i store html content of webpage in storm crawler?

问题 I am using strom-crawler-elastic. I can able to see the fetched urls and status of those. Configuration change in ES_IndexInit.sh file gives only url,title, host, text. But can i store the entire html content with html tags ? 回答1: The ES IndexerBolt gets the content of pages from the ParseFilter but does not do anything with it. One option would be to modify the code so that it pulls the content field from the incoming tuples and indexes it. Alternatively, you could implement a custom

Logic for Implementing a Dynamic Web Scraper in C#

阅读更多关于 Logic for Implementing a Dynamic Web Scraper in C#

问题 I am looking to develop a Web scraper in C# window forms. What I am trying to accomplish is as follows: Get the URL from the user. Load the Web page in the IE UI control(embedded browser) in WINForms. Allow the User to select a text (contiguous , small(not exceeding 50 chars)). from the loaded web page. When the User wishes to persist the location ( the HTML DOM location ) it has to be persisted into the DB, so that the user may use that location to fetch the data in that location during his

Scrapy and Xpath to extract data from javascript code

阅读更多关于 Scrapy and Xpath to extract data from javascript code

问题 I am in the process of learning and simultaneously building a web spider using scrapy. I need help with extracting some information from the following javascript code: <script language="JavaScript" type="text/javascript+gk-onload"> SKART = (SKART) ? SKART : {}; SKART.analytics = SKART.analytics || {}; SKART.analytics["category"] = "television"; SKART.analytics["vertical"] = "television"; SKART.analytics["supercategory"] = "homeentertainmentlarge"; SKART.analytics["subcategory"] = "television"

Visit Half Million Pages with Perl

阅读更多关于 Visit Half Million Pages with Perl

问题 Currently I'm using Mechanize and the get() method to get each site, and check with content() method each mainpage for something. I have a very fast computer + 10Mbit connection, and still, it took 9 hours to check 11K sites, which is not acceptable, the problem is, the speed of the get() function, which , obviously, needs to get the page,is there any way to make it faster,maybe to disable something,as I only need to main page html to be checked. Thanks, 回答1: Make queries in parallel instead

If I do everything on my page with Ajax, how can I do Search Engine Optimization?

阅读更多关于 If I do everything on my page with Ajax, how can I do Search Engine Optimization?

问题 How is the relationship between crawlers and ajax applications? Do the web crawlers or browsers read dynamically created meta tags? I thought about: adding anchors to the page creating permalinks to the content dynamically adding meta tags. http://code.google.com/web/ajaxcrawling/docs/learn-more.html 回答1: Update in how Google handles SEO with JavaScript: https://searchengineland.com/tested-googlebot-crawls-javascript-heres-learned-220157 which seems pretty good at this point so I would ignore

Scrapy CrawlSpider + Splash: how to follow links through linkextractor?

阅读更多关于 Scrapy CrawlSpider + Splash: how to follow links through linkextractor?

问题 I have the following code that is partially working, class ThreadSpider(CrawlSpider): name = 'thread' allowed_domains = ['bbs.example.com'] start_urls = ['http://bbs.example.com/diy'] rules = ( Rule(LinkExtractor( allow=(), restrict_xpaths=("//a[contains(text(), 'Next Page')]") ), callback='parse_item', process_request='start_requests', follow=True), ) def start_requests(self): for url in self.start_urls: yield SplashRequest(url, self.parse_item, args={'wait': 0.5}) def parse_item(self,

How to protect/monitor your site from crawling by malicious user

阅读更多关于 How to protect/monitor your site from crawling by malicious user

问题 Situation: Site with content protected by username/password (not all controlled since they can be trial/test users) a normal search engine can't get at it because of username/password restrictions a malicious user can still login and pass the session cookie to a "wget -r" or something else. The question would be what is the best solution to monitor such activity and respond to it (considering the site policy is no-crawling/scraping allowed) I can think of some options: Set up some traffic

Does Facebook know I'm scraping it with PhantomJS and can it change its website to counter me?

阅读更多关于 Does Facebook know I'm scraping it with PhantomJS and can it change its website to counter me?

问题 So, maybe I'm being paranoid. I'm scraping my Facebook timeline for a hobby project using PhantomJS. Basically, I wrote a program that finds all of my ads by querying the page for the text Sponsored with XPATH inside of phantom's page.evaluate block. The text was being displayed as innerHTML of html a elements. Things were working great for a few days and it was finding tons of ads. Then it stopped returning any results. When I logged into Facebook manually to inspect the elements again, I