How do I use the --accept-regex option for downloading a website with wget?

余生长醉 提交于 2020-07-09 07:25:09

问题


I'm trying to download an archive of my website — 3dsforums.com — using wget, but there are millions of pages I don't want to download, so I'm trying to tell wget to only download pages that match certain URL patterns, and yet I'm running into some roadblocks.

As an example, this is a URL I would like to download:

http://3dsforums.com/forumdisplay.php?f=46

...so I've tried using the --accept-regex option:

wget -mkEpnp --accept-regex "(forumdisplay\.php\?f=(\d+)$)" http://3dsforums.com

But it just downloads the home page of the website.

The only command that remotely works so far is the following:

wget -mkEpnp --accept-regex "(\w+\.php$)" http://3dsforums.com

This provides the following response:

Downloaded 9 files, 215K in 0.1s (1.72 MB/s)
Converting links in 3dsforums.com/faq.php.html... 16-19
Converting links in 3dsforums.com/index.html... 8-88
Converting links in 3dsforums.com/sendmessage.php.html... 14-15
Converting links in 3dsforums.com/register.php.html... 13-14
Converting links in 3dsforums.com/showgroups.php.html... 14-29
Converting links in 3dsforums.com/index.php.html... 16-80
Converting links in 3dsforums.com/calendar.php.html... 17-145
Converting links in 3dsforums.com/memberlist.php.html... 14-99
Converting links in 3dsforums.com/search.php.html... 15-16
Converted links in 9 files in 0.009 seconds.

Is there something wrong with my regular expressions? Or am I misunderstanding the use of the --accept-regex option? I've been trying all sorts of variations today but I'm not quite grasping what the actual problem is.


回答1:


wget by default uses POSIX regex \d class is expressed as [:digit:] and \w class is expressed as [:word:], plus why all the grouping? If your wget is compiled with PCRE support make your life easier and do it as:

wget -mkEpnp --regex-type pcre --accept-regex "forumdisplay.php\?f=\d+$" http://3dsforums.com

but... that will not work because your forum software creates automatic session IDs (s=<session_id>) and injects them in all the links, so you need to account for those as well:

wget -mkEpnp --regex-type pcre --accept-regex "forumdisplay\.php\?(s=.*)?f=\d+(s=.*)?$" http://3dsforums.com

The only problem is that now your files will be saved with the session ID in their names so you'll have to add another step when wget is finished - to bulk rename all the files with the session ID in their names. You could probably do it by piping wget to sed, but I'll leave that to you :)

And if your wget doesn't support PCRE this pattern will end up being quite long, but lets hope it does...



来源:https://stackoverflow.com/questions/44211968/how-do-i-use-the-accept-regex-option-for-downloading-a-website-with-wget

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!