How do I completely mirror a web page?

怎甘沉沦 提交于 2019-12-03 04:04:36

If your just looking to run a command and get a copy of a web site, use the tools that others have suggested, such as wget, curl, or some of the GUI tools. I use my own personal tool that I call webreaper (that's not the Windows WebReaper though. There are a few Perl programs I know about, including webmirror and a few others you can find on CPAN.

If you're looking to do this inside a Perl program you are writing (since you have the "perl" tag on your answer), there are many tools in CPAN that can help you at each step:

Good luck, :)

For an HTML-ized version of your sites you could use WinHTTrack - a free, open source, GPL program available. It will pull down pre-rendered versions of your pages, graphics, documents, zip files, movies, etc... Of course, since this is a mirrored copy any dynamic backend code such as database calls won't be dynamic anymore.

http://www.httrack.com/

Brian

Personally, the last time I had the urge to do this, I wrote a python script which made a copy of my browser cache, then manually visited all the pages I wished to mirror. A very ugly solution, but it has the nice advantage of not triggering any, "don't scrape my page" alarms. Thanks to Opera's links tab bar, "manually" downloading tens of thousands of pages wasn't nearly as hard as you'd think.

I'll echo the "it's not clear" comment. Are these web pages/sites that you've created, and you want to deploy them on multiple servers? If so, use relative references in your HTML, and you should be OK. Or, use a in your and adjust it on each site. But, relativity is really the way to go.

Or, are you saying that you'd like to download websites (like the Stack Overflow homepage, perl.com, etc.) to have local copies on your computer? I'll agree with Daniel - use wget.

Jim

phi

I use WebReaper

WisdomFusion

You may use wget gnu tools to grab an entire site like this:

wget -r -p -np -k URL

or, if you use perl, try these modules:

  • LWP::Simple

  • WWW::Mechanize

If wget is complicated or you dont have a linuxbox you could always user WebZip

It sounds like you want the caching functionality provided by a good proxy server.

Maybe look into something like SQUID? Pretty sure it can do it.

This is more of a sysadmin type question than programming though.

In most modern websites the front end only tells a small part of the story. Regardless of tools for stripping html, css and javascript you will still be missing the core functionality that is contained at the server.

Or maybe you were meaning something else.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!