I\'m looking for a way to pseudo-spider a website. The key is that I don\'t actually want the content, but rather a simple list of URIs. I can get reasonably close to this i
I've used a tool called xidel
xidel http://server -e '//a/@href' |
grep -v "http" |
sort -u |
xargs -L1 -I {} xidel http://server/{} -e '//a/@href' |
grep -v "http" | sort -u
A little hackish but gets you closer! This is only the first level. Imagine packing this up into a self recursive script!
Create a few regular expressions to extract the addresses from all
<a href="(ADDRESS_IS_HERE)">.
Here is the solution I would use:
wget -q http://example.com -O - | \
tr "\t\r\n'" ' "' | \
grep -i -o '<a[^>]\+href[ ]*=[ \t]*"\(ht\|f\)tps\?:[^"]\+"' | \
sed -e 's/^.*"\([^"]\+\)".*$/\1/g'
This will output all http, https, ftp, and ftps links from a webpage. It will not give you relative urls, only full urls.
Explanation regarding the options used in the series of piped commands:
wget -q makes it not have excessive output (quiet mode). wget -O - makes it so that the downloaded file is echoed to stdout, rather than saved to disk.
tr is the unix character translator, used in this example to translate newlines and tabs to spaces, as well as convert single quotes into double quotes so we can simplify our regular expressions.
grep -i makes the search case-insensitive grep -o makes it output only the matching portions.
sed is the Stream EDitor unix utility which allows for filtering and transformation operations.
sed -e just lets you feed it an expression.
Running this little script on "http://craigslist.org" yielded quite a long list of links:
http://blog.craigslist.org/
http://24hoursoncraigslist.com/subs/nowplaying.html
http://craigslistfoundation.org/
http://atlanta.craigslist.org/
http://austin.craigslist.org/
http://boston.craigslist.org/
http://chicago.craigslist.org/
http://cleveland.craigslist.org/
...
The absolute last thing I want to do is download and parse all of the content myself (i.e. create my own spider). Once I learned that Wget writes to stderr by default, I was able to redirect it to stdout and filter the output appropriately.
wget --spider --force-html -r -l2 $url 2>&1 \
| grep '^--' | awk '{ print $3 }' \
| grep -v '\.\(css\|js\|png\|gif\|jpg\)$' \
> urls.m3u
This gives me a list of the content resource (resources that aren't images, CSS or JS source files) URIs that are spidered. From there, I can send the URIs off to a third party tool for processing to meed my needs.
The output still needs to be streamlined slightly (it produces duplicates as it's shown above), but it's almost there and I haven't had to do any parsing myself.