web-crawler

Python threading or multiprocessing for web-crawler?

一曲冷凌霜 提交于 2020-01-06 07:16:47
问题 I've made simple web-crawler with Python. So far everything it does it creates set of urls that should be visited, set of urls that was already visited. While parsing page it adds all the links on that page to the should be visited set and page url to the already visited set and so on while length of should_be_visited is > 0. So far it does everything in one thread. Now I want to add parallelism to this application, so I need to have same kind of set of links and few threads / processes,

StormCrawler cannot connect to ElasticSearch

一世执手 提交于 2020-01-06 06:49:27
问题 While running the command: storm jar target/crawlIndexer-1.0-SNAPSHOT.jar org.apache.storm.flux.Flux --local es-injector.flux --sleep 86400000 I get an error saying: 8710 [Thread-26-status-executor[4 4]] ERROR c.d.s.e.p.StatusUpdaterBolt - Can't connect to ElasticSearch When running http://localhost:9200/ in browser ES successfully loads up. Kibana also connects to ES. So it must just be the connection from StromCrawler to ElasticSearch. What could be the issue? Snippet of full error: 8710

Stormcrawl with SQL external module gets ParseFilters exception at crawl sage

浪尽此生 提交于 2020-01-06 05:42:39
问题 I use Stromcrawler with SQL external module. I have updated my pop.xml with: <dependency> <groupId>com.digitalpebble.stormcrawler</groupId> <artifactId>storm-crawler-sql</artifactId> <version>1.8</version> </dependency> I use similar injector/crawl procedure as in the case with ES setup: storm jar target/stromcrawler-1.0-SNAPSHOT.jar org.apache.storm.flux.Flux --local sql-injector.flux --sleep 864000 I have created mysql database crawl , table urls and successfully injected my urls in it. For

AttributeError: 'NoneType' object has no attribute 'strip' with Python WebCrawler

你离开我真会死。 提交于 2020-01-06 05:26:15
问题 I'm writing a python program to crawl twitter using a combination of urllib2, the python twitter wrapper for the api, and BeautifulSoup. However, when I run my program, I get an error of the following type: ray_krueger RafaelNadal Traceback (most recent call last): File "C:\Users\Public\Documents\Columbia Job\Python Crawler\Twitter Crawler\crawlerversion9.py", line 78, in <module> crawl(start_follower, output, depth) File "C:\Users\Public\Documents\Columbia Job\Python Crawler\Twitter Crawler

Apache Nutch 2.3.1 Website home page handling

情到浓时终转凉″ 提交于 2020-01-06 04:45:48
问题 I have configured Nutch 2.3.1 to crawl some news websites. As websites homepages are going to change after one day that why I want to handle home page in some different way so that for homepage, only main categories are crawled instead of text as text will change after sometime ( I have observed similar things in Google). For rest of pages, its working fine ( crawling text etc.) 回答1: At the moment Nutch doesn't offer any special treatment for homepages, it is just one more URL to crawl. If

Handling special entities like & nbsp; , & pound; in HtmlCleaner

筅森魡賤 提交于 2020-01-06 04:33:29
问题 I am using HtmlCleaner library for html content extraction. It works fairly but with few limitations. It is not able to handle special characters like &pound or quotes etc. For e.x. for url : http://www.basicelegancefurnishings.co.uk/alaska-3-and-2-seater-sofa-setspan-classukmadespan-p-280.html, On giving xpath to price, It gives me "& pound;" inplace of £ Is there any property which we can set in htmlcleaner for handling this or any other solution. Thanks Jitendra 回答1: No, I don't believe

Handling special entities like & nbsp; , & pound; in HtmlCleaner

非 Y 不嫁゛ 提交于 2020-01-06 04:33:25
问题 I am using HtmlCleaner library for html content extraction. It works fairly but with few limitations. It is not able to handle special characters like &pound or quotes etc. For e.x. for url : http://www.basicelegancefurnishings.co.uk/alaska-3-and-2-seater-sofa-setspan-classukmadespan-p-280.html, On giving xpath to price, It gives me "& pound;" inplace of £ Is there any property which we can set in htmlcleaner for handling this or any other solution. Thanks Jitendra 回答1: No, I don't believe

How to avoid redirection of the webcrawler to the mobile edition?

时间秒杀一切 提交于 2020-01-05 12:17:23
问题 I subclassed a CrawlSpider and want to extract data from website. However, I always get redirected to the site's mobile version. I tried to change the USER_AGENT variable in scrapy's settings to Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1 , but still get redirected. Is there another way to signal another client and avoid redirection? 回答1: There are two types of redirection supported in Scrapy: RedirectMiddleware - Handle redirection of requests

Java how to find out if a URL is http or https?

ぃ、小莉子 提交于 2020-01-05 09:33:33
问题 I am writing a web crawler tool in Java. When I type the website name, how can I make it so that it connects to that site in http or https without me defining the protocol? try { Jsoup.connect("google.com").get(); } catch (IOException ex) { Logger.getLogger(LinkGUI.class.getName()).log(Level.SEVERE, null, ex); } But I get the error: java.lang.IllegalArgumentException: Malformed URL: google.com What can I do? Are there any classes or libraries that do this? What I'm trying to do is I have a

Java how to find out if a URL is http or https?

百般思念 提交于 2020-01-05 09:30:49
问题 I am writing a web crawler tool in Java. When I type the website name, how can I make it so that it connects to that site in http or https without me defining the protocol? try { Jsoup.connect("google.com").get(); } catch (IOException ex) { Logger.getLogger(LinkGUI.class.getName()).log(Level.SEVERE, null, ex); } But I get the error: java.lang.IllegalArgumentException: Malformed URL: google.com What can I do? Are there any classes or libraries that do this? What I'm trying to do is I have a