问题
I am working on a crawling project. When I do a simple URLConnection
connection to the website as shown in below:
URLConnection conn = new URL(url).openConnection(); BufferedReader reader = new BufferedReader(new InputStreamReader(conn.getInputStream()));
The method returns the HTML body correctly. However, the website makes inner requests for some fields. For example, the website fetches the total number of users from a different web service. In the web browser, the total number of users appear after some time, but with the URLConnection
method does not wait for the total number of users and the returned HTML does not contain such field.
In Java, is there any way to wait for a while to fetch all the data from a website using URLConnection
.
回答1:
From your "inner requests" comment it sounds like the website is using JavaScript (via a framework or just using native browser APIs) to fetch data and render these results into the DOM. This is very common nowadays with SPAs etc.
If that's the case, no amount of waiting will change the outcome from using a simple HTTP library like URLConnection
- but you can check this by saving the HTML locally and viewing it in your browser - what happens? When you examine it, is there JavaScript on that page?
To do this properly in code, you'll need something capable of behaving more like a browser, and executing that JS referenced by the HTML in a DOM-like environment. Try Selenium with PhantomJS or headless Chrome / Firefox, or maybe GhostDriver.
回答2:
Normally if you are getting the html body of the page, all calls made onn the server side of this web site must have been completed.
回答3:
If the website does not contain Javascript, then use the Jsoup (https://jsoup.org) library for Java. It loads all inner HTML requests needed to render the final HTML page.
来源:https://stackoverflow.com/questions/51621143/http-urlconnection-wait-for-inner-request