Spiderman Java开源垂直爬虫抓取示例【需求小复杂】

首先要说明的是，本文仅介绍了Spiderman解析 XML 的示例，Spiderman解析 HTML 也是差不多的原理，不过更考验“爬虫”的能力。

这个以后再发篇文章详细说明【已经有了请点击这里】:) 在Github的spiderman-sample项目里面有好几个案例，可以跑跑看。

这是Spiderman链接： http://www.oschina.net/p/spiderman

1.Spiderman是一个垂直领域的爬虫，可用于抓取特定目标网页的内容，并且解析为所需要的业务数据，整个过程追求无需任何编码就能实现，这样带来的好处是部署简单，并且网页内容变化可以灵活应对。

2.本文演示所抓取的目标URL是：http://www.alldealsasia.com/feeds/xml 这是一个XML文件，提供了该网站所有活动的Deal
3.怎么用Git+Maven搭建Spiderman使用这里就不详细说明了
4.直接看效果

这是目标网页【一个xml页面】
为了完成以上的目标，需要配置一个xml文件让Spiderman根据目标执行

最后来看看抓取之后的结果数据，我是在回调方法里面写入文件的：

// 初始化蜘蛛
Spiderman.init(new SpiderListener() {
    public void onNewUrls(Thread thread, Task task, Collection<String> newUrls) {}
    public void onDupRemoval(Thread currentThread, Task task, Collection<Task> validTasks) {}
    public void onNewTasks(Thread thread, Task task, Collection<Task> newTasks) {}
    public void onTargetPage(Thread thread, Task task, Page page) {}
    public void onInfo(Thread thread, String info) {}
    public void onError(Thread thread, String err, Exception e) {
        e.printStackTrace();
    }
    public void onParse(Thread thread, Task task, List<Map<String, Object>> models, int count) {
        String content = CommonUtil.toJson(models);
        try {
            FileUtil.writeFile(new File("d:/jsons/spiderman-result-"+count+".json"), content);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
    public void onPojo(Thread thread, List<Object> pojo, int count){}
});

// 启动蜘蛛
Spiderman.start();
		
//运行30s
Thread.sleep(CommonUtil.toSeconds("30s").longValue()*1000);
		
// 关闭蜘蛛
Spiderman.stop();

打开文件并且把文件内容进行json格式化：

待续...

来源：oschina

链接：https://my.oschina.net/u/146149/blog/99937

标签

SpiderMan

java爬虫

爬虫

xpath