Get html file Java

后端 未结 5 534
夕颜
夕颜 2021-01-23 21:39

Duplicate:

How do you Programmatically Download a Webpage in Java?

How to fetch html in Java

I\'m developping an application th

相关标签:
5条回答
  • 2021-01-23 22:23

    You can use java.net.URL and then open an input stream to read the HTML from the server. See the example here.

    0 讨论(0)
  • 2021-01-23 22:29

    URLConnection is fine for simple cases. When there are things like redirections involved, you are better off using Apache's HTTPClient

    0 讨论(0)
  • 2021-01-23 22:35

    This code downloads data from a URL, treating it as binary content:

    public class Download {
    
      private static void download(URL input, File output)
          throws IOException {
        InputStream in = input.openStream();
        try {
          OutputStream out = new FileOutputStream(output);
          try {
            copy(in, out);
          } finally {
            out.close();
          }
        } finally {
          in.close();
        }
      }
    
      private static void copy(InputStream in, OutputStream out)
          throws IOException {
        byte[] buffer = new byte[1024];
        while (true) {
          int readCount = in.read(buffer);
          if (readCount == -1) {
            break;
          }
          out.write(buffer, 0, readCount);
        }
      }
    
      public static void main(String[] args) {
        try {
          URL url = new URL("http://stackoverflow.com");
          File file = new File("data");
          download(url, file);
        } catch (IOException e) {
          e.printStackTrace();
        }
      }
    
    }
    

    The downside of this approach is that it ignores any meta-data, like the Content-Type, which you would get from using HttpURLConnection (or a more sophisticated API, like the Apache one).

    In order to parse the HTML data, you'll either need a specialized HTML parser that can handle poorly formed markup or tidy it first before parsing using a XML parser.

    0 讨论(0)
  • 2021-01-23 22:36

    Funnily enough I wrote utility method that does just that the other week

    /**
     * Retrieves the file specified by <code>fileUrl</code> and writes it to 
     * <code>out</code>.
     * <p>
     * Does not close <code>out</code>, but does flush.
     * @param fileUrl The URL of the file.
     * @param out An output stream to capture the contents of the file
     * @param batchWriteSize The number of bytes to write to <code>out</code>
     *                       at once (larger files than this will be written
     *                       in several batches)
     * @throws IOException If call to web server fails
     * @throws FileNotFoundException If the call to the web server does not
     *                               return status code 200. 
     */
    public static void getFileStream(String fileURL, OutputStream out, int batchWriteSize)
                                throws IOException{
        GetMethod get = new GetMethod(fileURL);
        HttpClient client = new HttpClient();
        HttpClientParams params = client.getParams();
        params.setSoTimeout(2000);
        client.setParams(params);
        try {
            client.executeMethod(get);
        } catch(ConnectException e){
            // Add some context to the exception and rethrow
            throw new IOException("ConnectionException trying to GET " + 
                    fileURL,e);
        }
    
        if(get.getStatusCode()!=200){
            throw new FileNotFoundException(
                    "Server returned " + get.getStatusCode());
        }
    
        // Get the input stream
        BufferedInputStream bis = 
            new BufferedInputStream(get.getResponseBodyAsStream());
    
        // Read the file and stream it out
        byte[] b = new byte[batchWriteSize];
        int bytesRead = bis.read(b,0,batchWriteSize);
        long bytesTotal = 0;
        while(bytesRead!=-1) {
            bytesTotal += bytesRead;
            out.write(b, 0, bytesRead);
            bytesRead = bis.read(b,0,batchWriteSize);;
        } 
        bis.close(); // Release the input stream.
        out.flush();        
    }
    

    Uses Apache Commons library i.e.

    import org.apache.commons.httpclient.HttpClient;
    import org.apache.commons.httpclient.methods.GetMethod;
    import org.apache.commons.httpclient.params.HttpClientParams;
    
    0 讨论(0)
  • 2021-01-23 22:40

    You could just use a URLConnection. See this Java Tutorial from Sun

    0 讨论(0)
提交回复
热议问题