How to add proxy support to Jsoup?

前端 未结 7 1690
独厮守ぢ
独厮守ぢ 2020-11-28 03:47

I am a beginner to Java and my first task is to parse some 10,000 URLs and extract some info out of it, for this I am using Jsoup and it\'s working fine.

相关标签:
7条回答
  • 2020-11-28 04:24

    Jsoup 1.9.1 and above: (recommended approach)

    // Fetch url with proxy
    Document doc = Jsoup //
                   .connect("http://www.example.com/") //
                   .proxy("127.0.0.1", 8080) // sets a HTTP proxy
                   .userAgent("Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2") //
                   .header("Content-Language", "en-US") //
                   .get();
    

    You may use also the overload Jsoup#proxy which takes a Proxy class (see below).

    Before Jsoup 1.9.1: (verbose approach)

    // Setup proxy
    Proxy proxy = new Proxy(                                      //
            Proxy.Type.HTTP,                                      //
            InetSocketAddress.createUnresolved("127.0.0.1", 8080) //
    );
    
    // Fetch url with proxy
    Document doc = Jsoup //
                   .connect("http://www.example.com/") //
                   .proxy(proxy) //
                   .userAgent("Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2") //
                   .header("Content-Language", "en-US") //
                   .get();
    

    References:

    • Connection#proxy(String,int)
    • Connection#proxy(Proxy)
    • Proxy class
    0 讨论(0)
  • 2020-11-28 04:24
    System.setProperty("http.proxyHost", "192.168.5.1");
    System.setProperty("http.proxyPort", "1080");
    Document doc = Jsoup.connect("www.google.com").get();
    

    This is wrong solution, because parsing is usually multithreaded and we usually need to change proxies. This code sets only one proxy for all threads. So better to not use Jsoup.Connection.

    0 讨论(0)
  • 2020-11-28 04:25

    You can easily set proxy

    System.setProperty("http.proxyHost", "192.168.5.1");
    System.setProperty("http.proxyPort", "1080");
    Document doc = Jsoup.connect("www.google.com").get();
    
    0 讨论(0)
  • 2020-11-28 04:33

    You might like to add this before running the program

    final String authUser = "USERNAME";
    final String authPassword = "PASSWORD";
    
    
    
    Authenticator.setDefault(
                   new Authenticator() {
                      public PasswordAuthentication getPasswordAuthentication() {
                         return new PasswordAuthentication(
                               authUser, authPassword.toCharArray());
                      }
                   }
                );
    
    ..
    
    System.setProperty("http.proxyHost", "192.168.5.1");
    System.setProperty("http.proxyPort", "1080");
    ..
    
    0 讨论(0)
  • 2020-11-28 04:37

    Try this code instead:

    URL url = new URL("http://www.example.com/");
    Proxy proxy = new Proxy(Proxy.Type.HTTP, new InetSocketAddress("127.0.0.1", 8080)); // or whatever your proxy is
    
    HttpURLConnection uc = (HttpURLConnection)url.openConnection(proxy);
    hc.setRequestProperty("User-Agent", "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2");
    uc.setRequestProperty("Content-Language", "en-US");
    uc.setRequestMethod("GET");
    uc.connect();
    
    Document doc = Jsoup.parse(uc.getInputStream());
    
    0 讨论(0)
  • 2020-11-28 04:38

    Jsoup does support using proxies, since v1.9.1. Connection class has the following methods:

    • proxy(Proxy p)
    • proxy(String host, int port)

    Which you can use it like this:

    Jsoup.connect("...url...").proxy("127.0.0.1", 8080);
    

    If you need authentication, you can use the Authenticator approach mentioned by @Navneet Swaminathan or simply set system properties:

    System.setProperty("http.proxyUser", "username");
    System.setProperty("http.proxyPassword", "password");
    

    or

    System.setProperty("https.proxyUser", "username");
    System.setProperty("https.proxyPassword", "password");
    
    0 讨论(0)
提交回复
热议问题