mirror http website, excluding certain files

前端 未结 4 2116
無奈伤痛
無奈伤痛 2021-02-14 19:17

I\'d like to mirror a simple password-protected web-portal to some data that i\'d like to keep mirrored & up-to-date. Essentially this website is just a directory listing wi

相关标签:
4条回答
  • 2021-02-14 19:36

    Parameter --reject 'pattern' actually worked for me with wget 1.14.

    For example:

    wget --reject rpm http://somerpmmirror.org/site/
    

    All the *.rpm files were not downloaded at all, only indexes.

    Warning: File patterns can be unintentionally expanded by bash if they match a file located in working directory. Please use quotes to avoid that:

    touch blahblah.rpm
    # working
    wget -R '*.rpm' ....
    # working
    wget -R "*.rpm" ....
    # not working
    wget -R *.rpm ....
    
    0 讨论(0)
  • 2021-02-14 19:37

    wget -X directory_to_exclude[,other_directory_to_exclude] -r ftp://URL_ftp_server

    SERVER
        |-logs
        |-etc
        |-cache
        |-public_html
          |-images
          |-videos ( want to exclude )
          |-files
          |-audio  (want to exclude)
    

    wget -X /public_html/videos,/public_html/audio ftp:SERVER/public_html/*

    0 讨论(0)
  • 2021-02-14 19:46

    Not possible with wget: http://linuxgazette.net/160/misc/lg/how_to_make_wget_exclude_a_particular_link_when_mirroring.html

    Well, I am not sure about newer versions, though.

    About 401 code, no state is kept (cookie is not used for HTTP authentication), so the username and password must be sent with every request. wget try the request w/o user & pass first before resorting to it.

    0 讨论(0)
  • 2021-02-14 19:51

    Pavuk (http://www.pavuk.org) looked like a promising alternative which allows you to mirror websites, excluding files based on url patterns, and filename extensions... but pavuk 0.9.35 seg-faults/dies randomly in the middle of long transfers & does not appear to be actively developed (this version was built Nov 2008).

    FYI, here's how I was using it:
    pavuk -mode mirror -force_reget -preserve_time -progress -Robots -auth_scheme 3 -auth_name x -auth_passwd x -dsfx 'html,bam,bai,tiff,jpg' -dont_leave_site -remove_old -cdir /path/to/root -subdir /path/to/root -skip_url_pattern ’*icons*’ -skip_url_pattern '*styles*' -skip_url_pattern '*images*' -skip_url_pattern '*bam*' -skip_url_pattern '*solidstats*' http://web.server.org/folder 2>&1 | tee pavuk-date.log

    in the end, wget --exclude-directories did the trick:

    wget --mirror --continue --progress=dot:mega --no-parent \
    --no-host-directories --cut-dirs=1 \
    --http-user x --http-password x \
    --exclude-directories='folder/*/folder_containing_large_data*' --reject "index.html*" \
    --directory-prefix /path/to/local/mirror
    http://my.server.org/folder
    

    Since the --exclude-directories wildcards don't span '/', you need to form your queries quite specifically to avoid downloading entire folders.

    Mark

    0 讨论(0)
提交回复
热议问题