问题
I've been using crawler4j for a few months now. I recently started noticing that it hangs on some of the sites to never return. The recommended solution is to set resumable to true. This is not an option for me as I am limited on space. I ran multiple tests and noticed that the hang was very random. It will crawl between 90-140 urls and then stop. I thought maybe it was the site but there is nothing suspicious in the sites robot.txt and all pages respond with 200 OK. I know the crawler hasn't crawled the entire site otherwise it would shutdown. What could be causing this and where should I start?
whats interesting is that i start crawlers with nonBlocking and after is a while loop checking status
controller.startNonBlocking(CrawlProcess.class, numberOfCrawlers);
while(true){
System.out.println("While looping");
}
when the crawler hangs the while loop also stops responding but the thread is still alive. Which means that the entire thread is not responsive. Therefore, I am unable to send a shutdown command.
UPDATE I figured out what is causing it to hang. I run a store in mysql step in the visit method. The step looks like this:
public void insertToTable(String dbTable, String url2, String cleanFileName, String dmn, String AID,
String TID, String LID, String att, String ttl, String type, String lbl, String QL,
String referrer, String DID, String fp_type, String ipAddress, String aT, String sNmbr) throws SQLException, InstantiationException, IllegalAccessException, ClassNotFoundException{
try{
String strdmn = "";
if(dmn.contains("www")){
strdmn = dmn.replace("http://www.","");
}else{
strdmn = dmn.replace("http://","");
}
String query = "INSERT INTO "+dbTable
+" (url,filename, dmn, AID, TID, LID, att, ttl, type, lbl, tracklist, referrer, DID, searchtype, description, fp_type, ipaddress," +
" aT, sNmbr, URL_Hash, iteration)VALUES('"
+url2+"','"+cleanFileName+"','"+strdmn+"','"+AID+"','"+TID+"','"+LID+"','"+att+"','"+ttl+"','"+type+"'" +
",'"+lbl+"','"+QL+"','"+dmn+"','"+DID+"','spider','"+cleanFileName+"','"+fp_type+"'," +
"'"+ipAddress+"','"+aT+"','"+sNmbr+"',MD5('"+url2+"'), 1) ON DUPLICATE KEY UPDATE iteration = iteration + 1";
Statement st2 = null;
con = DbConfig.openCons();
st2 = con.createStatement();
st2.executeUpdate(query);
//st2.execute("SELECT NOW()");
st2.close();
con.close();
if(con.isClosed()){
System.out.println("CON is CLOSED");
}else{
System.out.println("CON is OPEN");
}
if(st.isClosed()){
System.out.println("ST is CLOSED");
}else{
System.out.println("ST is OPEN");
}
}catch(NullPointerException npe){
System.out.println("NPE: " + npe);
}
}
what's very interesting is when I run the st2.execute("SELECT NOW()"); instead of the current st2.execute(query); it works fine and crawls the site without hanging. But for some reason st2.execute(query) causes it to hang after a few queries. It's not mysql because it doesn't output any exceptions. i thought maybe im getting a "too many connections" from mysql but that isn't the case. Does my process make sense to anyone?
回答1:
The importance of a finally block.
The crawler4j is using c3p0 pooling to insert into mysql. After a few queries the crawler would stop responding. It turned out to be a connection leak in c3p0 thanks to @djechlin's advice. I added a finally block like below and it works great now!
try{
//the insert method is here
}catch(SQLException e){
e.printStackTrace();
}finally{
if(st != null){
st.close();
}
if(rs != null){
rs.close();
}
}
来源:https://stackoverflow.com/questions/24807637/why-is-crawler4j-hanging-randomly