stormcrawler | 易学教程

crawl URLs based on their priorities in StormCrawler

阅读更多关于 crawl URLs based on their priorities in StormCrawler

问题 I am working on a crawler based on the StormCrawler project. I have a requirement to crawl URLs based on their priorities. For example, I have two types of priority: HIGH, LOW. I want to crawl HIGH priority URLs as soon as possible before LOW URLs. I need a method for handling the above problem in the crawler. How can I handle this requirement in Apache Storm and StormCrawler? 回答1: With Elasticsearch as a backend, you can configure the spouts to sort the URLs within a bucket by whichever

Storm Crawler with Java 11

阅读更多关于 Storm Crawler with Java 11

问题 Trying to update the Java version from Java 8 to Java 11 to compile and run the StromCrawler. My question- Does Storm Crawler is supported on Java 11? . As we I update the java version in my POM and build the project I was successfully build the project but when I tried to run the project I am getting the Following error while running the InjectorTopology- 560 [main] INFO c.a.h.c.InjectorTopology - ####### The Injector Topology Started ####### 563 [main] INFO c.a.h.c.u.PropertyFileReader -

Can i store html content of webpage in storm crawler?

阅读更多关于 Can i store html content of webpage in storm crawler?

问题 I am using strom-crawler-elastic. I can able to see the fetched urls and status of those. Configuration change in ES_IndexInit.sh file gives only url,title, host, text. But can i store the entire html content with html tags ? 回答1: The ES IndexerBolt gets the content of pages from the ParseFilter but does not do anything with it. One option would be to modify the code so that it pulls the content field from the incoming tuples and indexes it. Alternatively, you could implement a custom

Explicit special characters from crawling

阅读更多关于 Explicit special characters from crawling

问题 Working on Storm Crawler 1.13 and elastic search 6.5.2. How to restrict the crawler not to crawl/index the special characters � � � � � �� • 回答1: An easy way to do this is to write a ParseFilter like ParseData pd = parse.get(URL); String text = pd.getText(); // remove chars pd.setText(text); This will get called on documents parsed by JSoup or Tika. Have a look at the parse filters in the repository for examples. 来源： https://stackoverflow.com/questions/54096045/explicit-special

StormCrawler cannot connect to ElasticSearch

阅读更多关于 StormCrawler cannot connect to ElasticSearch

问题 While running the command: storm jar target/crawlIndexer-1.0-SNAPSHOT.jar org.apache.storm.flux.Flux --local es-injector.flux --sleep 86400000 I get an error saying: 8710 [Thread-26-status-executor[4 4]] ERROR c.d.s.e.p.StatusUpdaterBolt - Can't connect to ElasticSearch When running http://localhost:9200/ in browser ES successfully loads up. Kibana also connects to ES. So it must just be the connection from StromCrawler to ElasticSearch. What could be the issue? Snippet of full error: 8710

Stormcrawl with SQL external module gets ParseFilters exception at crawl sage

阅读更多关于 Stormcrawl with SQL external module gets ParseFilters exception at crawl sage

问题 I use Stromcrawler with SQL external module. I have updated my pop.xml with: <dependency> <groupId>com.digitalpebble.stormcrawler</groupId> <artifactId>storm-crawler-sql</artifactId> <version>1.8</version> </dependency> I use similar injector/crawl procedure as in the case with ES setup: storm jar target/stromcrawler-1.0-SNAPSHOT.jar org.apache.storm.flux.Flux --local sql-injector.flux --sleep 864000 I have created mysql database crawl , table urls and successfully injected my urls in it. For

StormCrawler DISCOVER and FETCH a website but nothing gets saved in docs

阅读更多关于 StormCrawler DISCOVER and FETCH a website but nothing gets saved in docs

问题 There is a website that I'm trying to crawl, the crawler DISCOVER and FETCH the URLs but there is nothing in docs. this is the website https://cactussara.ir . where is the problem?! And this is the robots.txt of this website: User-agent: * Disallow: / And this is my urlfilters.json : { "com.digitalpebble.stormcrawler.filtering.URLFilters": [ { "class": "com.digitalpebble.stormcrawler.filtering.basic.BasicURLFilter", "name": "BasicURLFilter", "params": { "maxPathRepetition": 8, "maxLength":

StormCrawler: Timeout waiting for connection from pool

阅读更多关于 StormCrawler: Timeout waiting for connection from pool

问题 We are consistently getting the following error when we increase either the number of threads or the number of executors for Fetcher bolt. org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.leaseConnection(PoolingHttpClientConnectionManager.java:286) ~[stormjar.jar:?] at org.apache.http.impl.conn.PoolingHttpClientConnectionManager$1.get(PoolingHttpClientConnectionManager.java:263) ~

Stormcrawler not indexing content with Elasticsearch

阅读更多关于 Stormcrawler not indexing content with Elasticsearch

问题 When using Stormcrawler it is indexing to Elasticsearch, but not the content. Stormcrawler is up-to-date with 'origin/master' https://github.com/DigitalPebble/storm-crawler.git Using elasticsearch-5.6.4 crawler-conf.yaml has indexer.url.fieldname: "url" indexer.text.fieldname: "content" indexer.canonical.name: "canonical" The url and title fields are indexed, but not content. I have trying to get this working by following Julien's tutorial at: https://www.youtube.com/watch?v=xMCuWpPh-4A

Run StormCrawler in local mode or install Apache Storm?

阅读更多关于 Run StormCrawler in local mode or install Apache Storm?

问题 So I'm trying to figure out how to install and setup Storm/Stormcrawler with ES and Kibana as described here. I never installed Storm on my local machine because I've worked with Nutch before and I never had to install Hadoop locally... thought it might be the same with Storm(maybe not?). I'd like to start crawling with Stormcrawler instead of Nutch now. It seems that if I just download a release and add the /bin to my PATH, I can only talk to a remote cluster. It seems like I need to setup a