I need to archive complete pages including any linked images etc. on my linux server. Looking for the best solution. Is there a way to save all assets and then relink them all t
If all the content in the web page was static, you could get around this issue with something like wget:
$ wget -r -l 10 -p http://my.web.page.com/
or some variation thereof.
Since you also have dynamic pages, you cannot in general archive such a web page using wget
or any simple HTTP client. A proper archive needs to incorporate the contents of the backend database and any server-side scripts. That means that the only way to do this properly is to copy the backing server-side files. That includes at least the HTTP server document root and any database files.
EDIT:
As a work-around, you could modify your webpage so that a suitably priviledged user could download all the server-side files, as well as a text-mode dump of the backing database (e.g. an SQL dump). You should take extreme care to avoid opening any security holes through this archiving system.
If you are using a virtual hosting provider, most of them provide some kind of Web interface that allows backing-up the whole site. If you use an actual server, there is a large number of back-up solutions that you could install, including a few Web-based ones for hosted sites.
wget
can do that, for example:
wget -r http://example.com/
This will mirror the whole example.com site.
Some interesting options are:
-Dexample.com
: do not follow links of other domains
--html-extension
: renames pages with text/html content-type to .html
Manual: http://www.gnu.org/software/wget/manual/
wget -r http://yoursite.com
Should be sufficient and grab images/media. There are plenty of options you can feed it.
Note: I believe wget
nor any other program supports downloading images specified through CSS - so you may need to do that yourself manually.
Here may be some useful arguments: http://www.linuxjournal.com/content/downloading-entire-web-site-wget
Use following command:
wget -E -k -p http://yoursite.com
Use -E
to adjust extensions. Use -k
to convert links to load the page from your storage. Use -p
to download all objects inside the page.
Please note that this command does not download other pages hyperlinked in the specified page. It means that this command only download objects required to load the specified page properly.