问题
I'm using crawler4j to build a simple web crawler. What I want to do is to invoke the crawl control every 10 minutes. I created a servlet that starts when my Tomcat server starts, and in the servlet I am using ScheduledExecutorService for the scheduling. However, the crawl control only fetches me data ONCE (not every 10 mins like I wanted). Is there a better way to schedule my crawl to execute every 10 mins? Below is my code in the servlet.
public class ScheduleControl extends HttpServlet {
private final static ScheduledExecutorService scheduler = Executors.newScheduledThreadPool(1);
@Override
public void init() throws ServletException {
final Runnable crawler = new Runnable() {
@Override
public void run() {
String[] args = {"/Users/kevin/Desktop", "7"};
try {
SaleCrawlControl.main(args);
} catch (Exception e) {
System.out.println("Exception " + e);
}
}
};
final ScheduledFuture crawlerHandle = scheduler.scheduleAtFixedRate(crawler, 0, 10, MINUTES);
scheduler.schedule(new Runnable() {
@Override
public void run() {
crawlerHandle.cancel(true);
scheduler.shutdown();
}
}, 60, MINUTES);
}
回答1:
Crawler4j version 3.6 and later has fixes that resolved this issue. I was using version 3.5 so I was having this issue. I later upgraded to version 4.1 and it was working.
来源:https://stackoverflow.com/questions/28636610/how-to-schedule-crawler4j-crawl-control-to-run-periodically