Stormcrawl with SQL external module gets ParseFilters exception at crawl sage

浪尽此生 提交于 2020-01-06 05:42:39

问题


I use Stromcrawler with SQL external module. I have updated my pop.xml with:

<dependency>
        <groupId>com.digitalpebble.stormcrawler</groupId>
        <artifactId>storm-crawler-sql</artifactId>
        <version>1.8</version>
</dependency>

I use similar injector/crawl procedure as in the case with ES setup:

storm jar target/stromcrawler-1.0-SNAPSHOT.jar  org.apache.storm.flux.Flux --local sql-injector.flux --sleep 864000

I have created mysql database crawl, table urls and successfully injected my urls in it. For example, If I do select * from crawl.urls limit 5;, I can see urls, status, and other fields. From this, I conclude that at this stage, the crawler connects to the database.

Sql-injector looks like this:

name: "injector"

includes:
- resource: true
  file: "/crawler-default.yaml"
  override: false

- resource: false
  file: "crawler-conf.yaml"
  override: true

- resource: false
  file: "sql-conf.yaml"
  override: true

- resource: false
  file: "my-config.yaml"
  override: true

components:
 - id: "scheme"
className: "com.digitalpebble.stormcrawler.util.StringTabScheme"
constructorArgs:
  - DISCOVERED

spouts:
 - id: "spout"
  className: "com.digitalpebble.stormcrawler.spout.FileSpout"
parallelism: 1
constructorArgs:
  - "seeds.txt"
  - ref: "scheme"

bolts:
- id: "status"
className: "com.digitalpebble.stormcrawler.sql.StatusUpdaterBolt"
parallelism: 1

streams:
 - from: "spout"
to: "status"
grouping:
  type: CUSTOM
  customClass:
    className: "com.digitalpebble.stormcrawler.util.URLStreamGrouping"
    constructorArgs:
      - "byHost"

When I run:

storm jar target/stromcrawler-1.0-SNAPSHOT.jar  org.apache.storm.flux.Flux --remote sql-crawler.flux

I got the following exception at the Parse bolt:

java.lang.RuntimeException: Exception caught while loading the ParseFilters from parsefilters.json at com.digitalpebble.stormcrawler.parse.ParseFilters.fromConf(ParseFilters.java:67) at com.digitalpebble.stormcrawler.bolt.JSoupParserBolt.prepare(JSoupParserBolt.java:116) at org.apache.storm.daemon.executor$fn__5043$fn__5056.invoke(executor.clj:803) at org.apache.storm.util$async_loop$fn__557.invoke(util.clj:482) at clojure.lang.AFn.run(AFn.java:22) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Unable to build JSON object from file at com.digitalpebble.stormcrawler.parse.ParseFilters.(ParseFilters.java:92) at com.digitalpebble.stormcrawler.parse.ParseFilters.fromConf(ParseFilters.java:62) ... 5 more Caused by: com.fasterxml.jackson.core.JsonParseException: Unexpected character ('}' (code 125)): was expecting double-quote to start field name...

Screenshot of StormUI

sql-crawler.flux:

name: "crawler"

includes:
- resource: true
  file: "/crawler-default.yaml"
  override: false

- resource: false
  file: "crawler-conf.yaml"
  override: true

- resource: false
  file: "sql-conf.yaml"
  override: true

- resource: false
  file: "my-config.yaml"
  override: true

spouts:
- id: "spout"
className: "com.digitalpebble.stormcrawler.sql.SQLSpout"
parallelism: 100

bolts:
- id: "partitioner"
className: "com.digitalpebble.stormcrawler.bolt.URLPartitionerBolt"
parallelism: 1
- id: "fetcher"
className: "com.digitalpebble.stormcrawler.bolt.FetcherBolt"
parallelism: 1
- id: "sitemap"
className: "com.digitalpebble.stormcrawler.bolt.SiteMapParserBolt"
parallelism: 1
- id: "parse"
className: "com.digitalpebble.stormcrawler.bolt.JSoupParserBolt"
parallelism: 1
- id: "status"
className: "com.digitalpebble.stormcrawler.sql.StatusUpdaterBolt"
parallelism: 1


streams:
- from: "spout"
to: "partitioner"
grouping:
  type: SHUFFLE

- from: "partitioner"
to: "fetcher"
grouping:
  type: FIELDS
  args: ["key"]

- from: "fetcher"
to: "sitemap"
grouping:
  type: LOCAL_OR_SHUFFLE

- from: "sitemap"
to: "parse"
grouping:
  type: LOCAL_OR_SHUFFLE

- from: "fetcher"
to: "status"
grouping:
  type: FIELDS
  args: ["url"]
  streamId: "status"

- from: "sitemap"
to: "status"
grouping:
  type: FIELDS
  args: ["url"]
  streamId: "status"

- from: "parse"
to: "status"
grouping:
  type: FIELDS
  args: ["url"]
  streamId: "status"

It looks like object StringUtils at ParseFilters.java:60 is blank.


回答1:


Check the content of src/main/resources.parsefilters.json (or whichever value you might have set for parsefilters.config.file), judging by the error message, the JSON it contains is not valid. Don't forget to rebuild the uber jar with mvn clean package



来源:https://stackoverflow.com/questions/50483996/stormcrawl-with-sql-external-module-gets-parsefilters-exception-at-crawl-sage

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!