JAVA爬虫(一)
jar包准备:
- htmlparser.jar
- httpclient-4.1.2.jar
HttpClient:
-
用于模拟客户端请求
-
HttpClient httpClient = new DefaultHttpClient();
-
HttpGet:请求方式,还有HttpPost。
-
HttpGet httpGet = new HttpGet(url);
-
-
HttpResponse:服务器响应
-
HttpResponse response = httpClient.execute(httpGet);
-
//获取响应状态 int status = response.getStatusLine().getStatusCode();
-
-
HttpEntity:响应实体
-
//一般响应码若为200,即可获得Entity HttpEntity entity = response.getEntity();
-
//可将entity转化为字符串或byte数组 byte[] bytes = EntityUtils.toByteArray(entity); String msg1 = EntityUtils.toString(entity); String msg2 = EntityUtils.toString(entity, "UTF-8");
-
HtmlParser:
-
Parser:用于解析url
-
Parser parser = new Parser(url); parser.setEncoding("UTF-8");
-
-
NodeFilter:过滤标签的接口,需实现其accept方法
-
//以下为一个事例,找出所有class属性为music_block的div标签 NodeFilter musicBlock = new NodeFilter() { @Override public boolean accept(Node node) { if(node.getText().startsWith("div class=\"music_block\"")) { return true; }else { return false; } } };
-
-
NodeList:就是一个标签列表,parser通过Node过滤器可返回一个Node列表
-
NodeList musicBlocks = parser.extractAllNodesThatMatch(musicBlock);
-
-
Node:标签
-
通过NodeList可获取Node
Node aBlock = musicBlocks.elementAt(2);
-
通过Node获得其他相关Node的方法
//获得第一个孩子 Node node1 = aBlock.getFirstChild(); //获得最后一个孩子 Node node2 = aBlock.getLastChild(); //获得下一个兄弟 Node node3 = aBlock.getNextSibling(); //还有一堆方法,不细说了
-
变成String
String info1 = aBlock.getText(); String info2 = aBlock.toString(); String info3 = aBlock.toHtml();
-
来源:oschina
链接:https://my.oschina.net/u/4189208/blog/3157921