How do you Programmatically Download a Webpage in Java

前端 未结 11 2040
无人共我
无人共我 2020-11-22 11:20

I would like to be able to fetch a web page\'s html and save it to a String, so I can do some processing on it. Also, how could I handle various types of compr

相关标签:
11条回答
  • 2020-11-22 11:55

    I used the actual answer to this post (url) and writing the output into a file.

    package test;
    
    import java.net.*;
    import java.io.*;
    
    public class PDFTest {
        public static void main(String[] args) throws Exception {
        try {
            URL oracle = new URL("http://www.fetagracollege.org");
            BufferedReader in = new BufferedReader(new InputStreamReader(oracle.openStream()));
    
            String fileName = "D:\\a_01\\output.txt";
    
            PrintWriter writer = new PrintWriter(fileName, "UTF-8");
            OutputStream outputStream = new FileOutputStream(fileName);
            String inputLine;
    
            while ((inputLine = in.readLine()) != null) {
                System.out.println(inputLine);
                writer.println(inputLine);
            }
            in.close();
            } catch(Exception e) {
    
            }
    
        }
    }
    
    0 讨论(0)
  • 2020-11-22 11:56

    On a Unix/Linux box you could just run 'wget' but this is not really an option if you're writing a cross-platform client. Of course this assumes that you don't really want to do much with the data you download between the point of downloading it and it hitting the disk.

    0 讨论(0)
  • 2020-11-22 12:00

    Bill's answer is very good, but you may want to do some things with the request like compression or user-agents. The following code shows how you can various types of compression to your requests.

    URL url = new URL(urlStr);
    HttpURLConnection conn = (HttpURLConnection) url.openConnection(); // Cast shouldn't fail
    HttpURLConnection.setFollowRedirects(true);
    // allow both GZip and Deflate (ZLib) encodings
    conn.setRequestProperty("Accept-Encoding", "gzip, deflate");
    String encoding = conn.getContentEncoding();
    InputStream inStr = null;
    
    // create the appropriate stream wrapper based on
    // the encoding type
    if (encoding != null && encoding.equalsIgnoreCase("gzip")) {
        inStr = new GZIPInputStream(conn.getInputStream());
    } else if (encoding != null && encoding.equalsIgnoreCase("deflate")) {
        inStr = new InflaterInputStream(conn.getInputStream(),
          new Inflater(true));
    } else {
        inStr = conn.getInputStream();
    }
    

    To also set the user-agent add the following code:

    conn.setRequestProperty ( "User-agent", "my agent name");
    
    0 讨论(0)
  • 2020-11-22 12:02

    Here's some tested code using Java's URL class. I'd recommend do a better job than I do here of handling the exceptions or passing them up the call stack, though.

    public static void main(String[] args) {
        URL url;
        InputStream is = null;
        BufferedReader br;
        String line;
    
        try {
            url = new URL("http://stackoverflow.com/");
            is = url.openStream();  // throws an IOException
            br = new BufferedReader(new InputStreamReader(is));
    
            while ((line = br.readLine()) != null) {
                System.out.println(line);
            }
        } catch (MalformedURLException mue) {
             mue.printStackTrace();
        } catch (IOException ioe) {
             ioe.printStackTrace();
        } finally {
            try {
                if (is != null) is.close();
            } catch (IOException ioe) {
                // nothing to see here
            }
        }
    }
    
    0 讨论(0)
  • 2020-11-22 12:02

    Well, you could go with the built-in libraries such as URL and URLConnection, but they don't give very much control.

    Personally I'd go with the Apache HTTPClient library.
    Edit: HTTPClient has been set to end of life by Apache. The replacement is: HTTP Components

    0 讨论(0)
提交回复
热议问题