I want to implement a java method which takes URL as input and stores the entire webpage including css, images, js (all related resources) on my disk. I have used Jsoup html
This GitHub project does this, using jSoup. No need to write it again if it already exists!
EDIT: I made an improved version of this class, and added new features :
It can:
Extract URL's from Linked or Inline CSS, eg. for background images, and download & save those too.
It does multithreaded downloading of all the files, (images, scripts, etc.)
Gives details about progress and errors.
Can get HTML frames embedded in the HTML document, and nested frames also.
Some caveats:
Uses JSoup and OkHttp, so you need to have those libraries.
GPL licenced, for now anyway.
Basically, you can do it with Jsoup:
Document doc = Jsoup.connect("http://rabotalux.com.ua/vacancy/4f4f800c8bc1597dc6fc7aff").get();
Elements links = doc.select("link");
Elements scripts = doc.select("script");
for (Element element : links) {
System.out.println(element.absUrl("href"));
}
for (Element element : scripts) {
System.out.println(element.absUrl("src"));
}
And so on with images and all related resources.
BUT if your site creates some elements with javaScript, Jsoup will skip it, as it cant execute javaScript
I have encountered the similar problem before couple of years where we have used exactly the same mechanism which you are planing. parse the html content and convert relative path to absolute path and also we have used multiple threads to run simultaneously and retrieve images, java script etc for performance optimization. I don't know it should done as we did or not but at the end it works for us.:-)