mirror http website, excluding certain files

前端 未结 4 2104
無奈伤痛
無奈伤痛 2021-02-14 19:17

I\'d like to mirror a simple password-protected web-portal to some data that i\'d like to keep mirrored & up-to-date. Essentially this website is just a directory listing wi

4条回答
  •  伪装坚强ぢ
    2021-02-14 19:51

    Pavuk (http://www.pavuk.org) looked like a promising alternative which allows you to mirror websites, excluding files based on url patterns, and filename extensions... but pavuk 0.9.35 seg-faults/dies randomly in the middle of long transfers & does not appear to be actively developed (this version was built Nov 2008).

    FYI, here's how I was using it:
    pavuk -mode mirror -force_reget -preserve_time -progress -Robots -auth_scheme 3 -auth_name x -auth_passwd x -dsfx 'html,bam,bai,tiff,jpg' -dont_leave_site -remove_old -cdir /path/to/root -subdir /path/to/root -skip_url_pattern ’*icons*’ -skip_url_pattern '*styles*' -skip_url_pattern '*images*' -skip_url_pattern '*bam*' -skip_url_pattern '*solidstats*' http://web.server.org/folder 2>&1 | tee pavuk-date.log

    in the end, wget --exclude-directories did the trick:

    wget --mirror --continue --progress=dot:mega --no-parent \
    --no-host-directories --cut-dirs=1 \
    --http-user x --http-password x \
    --exclude-directories='folder/*/folder_containing_large_data*' --reject "index.html*" \
    --directory-prefix /path/to/local/mirror
    http://my.server.org/folder
    

    Since the --exclude-directories wildcards don't span '/', you need to form your queries quite specifically to avoid downloading entire folders.

    Mark

提交回复
热议问题