问题
I'm trying to mirror a webpage recursively, e.g. getting all pages on one webpage. All webpages are in subfolders of just one folder, therefore I could easily mirror all webpages using wget:
wget --mirror --recursive --page-requisites --adjust-extension --no-parent --convert-links https://www.example.com/
However, the page is mirrored before some JS scripts are executed, and those JS scripts don't get mirrored. I need to mirror them too, somehow, because they change the webpage's DOM. Another option would be to wait for the site to finish loading and mirroring the loaded webpage (the task isn't time critical).
I've already tried mirroring the webpage with PhantomJS, but I can't use recursion using PhantomJS, or at least I couldn't find out how. I also took a closer look at the wget man page, but couldn't find any corresponding options.
Is there any possibility to do so? Thanks in advance.
回答1:
wget
doesn't execute any javascript. You might need to go through a proxy like splash. I've used splash before with scrapy spiders, but never with wget. Worth trying though
来源:https://stackoverflow.com/questions/50629447/mirroring-a-webpage-recursively-after-js-execution