爬虫--[HttpClient] | 易学教程

爬虫技术可以获取互联网上开放的网页文档或其他文档，在java中HttpClient是比较好用的模拟请求和爬虫组件

下面看一个简单的职位爬去的实例：

1 下载HttpClient

最新HttpClient版本是4.x，我们可以去官网下载，本章所用版本为：commons-httpclient-3.0.1.jar

这是它的核心包，但是要编写一个完整的爬虫应用，它还需要依赖如下：

2 使用HttpClient进行模拟请求

2.1 创建HttpClient对象:

HttpClient httpClient=new HttpClient();

2.2 通过get或post方式请求页面：

GetMethod getMethod=new GetMethod("http://www.51job.com");

假如是post请求，那么就得使用：

PostMethod postMethod=new PostMethod("http://www.51job.com");

2.3 执行请求：

httpClient.executeMethod(getMethod);

2.4 得到返回的网页：

String html= getMethod.getResponseBodyAsString()；

假如网页非常大时，需要使用：

getMethod.getResponseBodyAsStream()；

它返回一个InputStream，需要我们使用流读取出来

2.5 释放请求连接：

getMethod.releaseConnection();

3.参数，Header头部信息，cookie

3.1 对于post请求（比如模拟登录），有时候需要传入请求参数,需要先构造一个参数数组：

NameValuePair[] vnps=new NameValuePair[X];

NameValuePair nvp=new NameValuePair("username","admin");

NameValuePair nvp=new NameValuePair("password","admin");

postMethod.setRequestBody(vnps);

3.2 Header头部信息

有些服务器会根据头部信息来做一些业务逻辑，假如我们在模拟请求时没有传入这些头部信息，可能

不会达到我们想要的效果。

这些头部信息不是随便写的，而是根据某种http分析工具得出的（比如httpwatch）

getMethod.setRequestHeader("Accept-Language","zh-CN,en-US;q=0.7,ja;q=0.3");

getMethod.setRequestHeader("Accept","application/javascript, */*;q=0.8");

getMethod.setRequestHeader("Referer","http://www.baidu.com/");

。。。

3.3 Cookie

有时候模拟请求也需要带入cookie信息（特别是需要使用session的时候）,cookie本身也是一种头部：

method.setRequestHeader("Cookie", " H_PS_PSSID=2994_2776_1428_2975_2977_2250_2542_2701");

这个cookie也不是随便写的，是通过httpwatch分析出来的

有时候对于登录之后的网页抓取,我们往往需要把之前请求时产生的cookie也一直保存，此时可以先得到之前请求后产生的cookie：

Cookie[] cookies = httpClient.getState().getCookies();

String tmpcookies="";

for (Cookie c : cookies) {

tmpcookies += c.toString() + ";";

}

然后传入这些cookie：

method.setRequestHeader("Cookie", tmpcookies);

核心代码如下：

import org.apache.commons.httpclient.HttpClient;
import org.apache.commons.httpclient.methods.GetMethod;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.nodes.TextNode;
import org.jsoup.select.Elements;


public class JobCrawl {

public static void main(String[] args)throws Exception {
//创建一个Http请求的客户端 
HttpClient httpClient=new HttpClient();
   //创建一个Get方法的请求
GetMethod getMethod=new GetMethod("http://search.51job.com/list/%2B,%2B,%2B,%2B,%2B,%2B,java,2,%2B.html?lang=c&stype=1&image_x=30&image_y=18");
//执行请求
httpClient.executeMethod(getMethod);
//返回网页信息
String html=getMethod.getResponseBodyAsString();
   //转码
html=new String(html.getBytes("iso8859-1"),"gb2312");
Document doc=Jsoup.parse(html);
Elements elements=doc.select("a.jobname");
   for(int i=0;i<elements.size();i++){
    Element ele=elements.get(i);
    String url=ele.attr("href");
    GetMethod gm=new GetMethod(url);
    httpClient.executeMethod(gm);
    String detailJob=gm.getResponseBodyAsString();
    detailJob=new String(detailJob.getBytes("iso8859-1"),"gb2312");
    Utils.createFile("D:\\Workspaces\\HttpClientTest\\doc",i+".html", detailJob);
    Document job_doc=Jsoup.parse(detailJob);
    //职位名称
    String jobname=job_doc.select("td.sr_bt").get(0).text();
   
    //公司名称
    Element company_a=job_doc.select("table.jobs_1 a").get(0);
    String companyname=company_a.text();
   
    //职位职能
    Element target=job_doc.select("strong:contains(职位职能)").get(0);
    System.out.println(target.nextSibling());
    TextNode targetNode=(TextNode)target.nextSibling();
    String targetName=targetNode.text();
    System.out.println("职位名称: "+jobname);
    System.out.println("公司名称: "+companyname);
    System.out.println("职位职能: "+targetName);
    System.out.println("=====================================");
    //System.out.println(ele.text()+" "+ele.attr("href"));
   }
}

}

import java.io.File;
import java.io.FileOutputStream;
import java.io.FileWriter;

public class Utils {

	public static void createFile(String directorpath, String fileName,
			String html) throws Exception {
		File director = new File(directorpath);
		if (!director.exists()) {
			director.mkdirs();
		}
		File f = new File(director.getAbsoluteFile() + File.separator
				+ fileName);
		System.out.println(fileName);
		if (!f.exists()) {
			f.createNewFile();
		}
		FileOutputStream fos = new FileOutputStream(f);
		FileWriter fw = new FileWriter(f);
		fw.write(new String(html));
		fw.close();
		fw = null;
		fos.close();
	}
}

来源：oschina

链接：https://my.oschina.net/u/1413992/blog/341371

标签

爬虫

httpclient