I am running Nutch v. 1.6 and it is crawling specific sites correctly, but I can\'t seem to get the syntax correct for the file NUTCH_ROOT/conf/regex-urlfilter.txt
According to http://wiki.apache.org/nutch/FAQ#What_happens_if_I_inject_urls_several_times.3F you can't have multiple URLs (they will be ignored). What about to put only:
+^http://www.example.com/foo.cfm/(.+)*$
which should cover your first line: +^http://www.example.com/foo.cfm$
as well, or, if there are problems with /
, try:
+^http://www.example.com/foo.cfm//?(.+)*$
Where //?
should stand for character /
or