Nutch regex-urlfilter syntax

前端 未结 1 1800
一生所求
一生所求 2020-12-21 05:08

I am running Nutch v. 1.6 and it is crawling specific sites correctly, but I can\'t seem to get the syntax correct for the file NUTCH_ROOT/conf/regex-urlfilter.txt

相关标签:
1条回答
  • 2020-12-21 05:49

    According to http://wiki.apache.org/nutch/FAQ#What_happens_if_I_inject_urls_several_times.3F you can't have multiple URLs (they will be ignored). What about to put only:

    +^http://www.example.com/foo.cfm/(.+)*$
    

    which should cover your first line: +^http://www.example.com/foo.cfm$ as well, or, if there are problems with /, try:

    +^http://www.example.com/foo.cfm//?(.+)*$
    

    Where //? should stand for character / or

    0 讨论(0)
提交回复
热议问题