Mirroring a webpage recursively after JS execution

末鹿安然 提交于 2020-01-14 06:12:17

问题


I'm trying to mirror a webpage recursively, e.g. getting all pages on one webpage. All webpages are in subfolders of just one folder, therefore I could easily mirror all webpages using wget:

wget --mirror --recursive --page-requisites --adjust-extension --no-parent --convert-links https://www.example.com/

However, the page is mirrored before some JS scripts are executed, and those JS scripts don't get mirrored. I need to mirror them too, somehow, because they change the webpage's DOM. Another option would be to wait for the site to finish loading and mirroring the loaded webpage (the task isn't time critical).

I've already tried mirroring the webpage with PhantomJS, but I can't use recursion using PhantomJS, or at least I couldn't find out how. I also took a closer look at the wget man page, but couldn't find any corresponding options.

Is there any possibility to do so? Thanks in advance.


回答1:


wget doesn't execute any javascript. You might need to go through a proxy like splash. I've used splash before with scrapy spiders, but never with wget. Worth trying though



来源:https://stackoverflow.com/questions/50629447/mirroring-a-webpage-recursively-after-js-execution

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!