I\'d like to mirror a simple password-protected web-portal to some data that i\'d like to keep mirrored & up-to-date. Essentially this website is just a directory listing wi
Parameter --reject 'pattern'
actually worked for me with wget 1.14.
For example:
wget --reject rpm http://somerpmmirror.org/site/
All the *.rpm
files were not downloaded at all, only indexes.
Warning: File patterns can be unintentionally expanded by bash if they match a file located in working directory. Please use quotes to avoid that:
touch blahblah.rpm
# working
wget -R '*.rpm' ....
# working
wget -R "*.rpm" ....
# not working
wget -R *.rpm ....
wget -X directory_to_exclude[,other_directory_to_exclude] -r ftp://URL_ftp_server
SERVER |-logs |-etc |-cache |-public_html |-images |-videos ( want to exclude ) |-files |-audio (want to exclude)
wget -X /public_html/videos,/public_html/audio ftp:SERVER/public_html/*
Not possible with wget: http://linuxgazette.net/160/misc/lg/how_to_make_wget_exclude_a_particular_link_when_mirroring.html
Well, I am not sure about newer versions, though.
About 401 code, no state is kept (cookie is not used for HTTP authentication), so the username and password must be sent with every request. wget try the request w/o user & pass first before resorting to it.
Pavuk (http://www.pavuk.org) looked like a promising alternative which allows you to mirror websites, excluding files based on url patterns, and filename extensions... but pavuk 0.9.35 seg-faults/dies randomly in the middle of long transfers & does not appear to be actively developed (this version was built Nov 2008).
FYI, here's how I was using it:
pavuk -mode mirror -force_reget -preserve_time -progress -Robots -auth_scheme 3 -auth_name x -auth_passwd x -dsfx 'html,bam,bai,tiff,jpg' -dont_leave_site -remove_old -cdir /path/to/root -subdir /path/to/root -skip_url_pattern ’*icons*’ -skip_url_pattern '*styles*' -skip_url_pattern '*images*' -skip_url_pattern '*bam*' -skip_url_pattern '*solidstats*' http://web.server.org/folder 2>&1 | tee pavuk-
date.log
in the end, wget --exclude-directories
did the trick:
wget --mirror --continue --progress=dot:mega --no-parent \
--no-host-directories --cut-dirs=1 \
--http-user x --http-password x \
--exclude-directories='folder/*/folder_containing_large_data*' --reject "index.html*" \
--directory-prefix /path/to/local/mirror
http://my.server.org/folder
Since the --exclude-directories
wildcards don't span '/', you need to form your queries quite specifically to avoid downloading entire folders.
Mark