How to find all links / pages on a website

后端未结

关注

 5  1281

Is it possible to find all the pages and links on ANY given website? I\'d like to enter a URL and produce a directory tree of all links from that site?

I\'ve looked

相关标签:

5条回答

耶瑟儿～

2020-11-30 17:53

function getalllinks($url) {
    $links = array();
    if ($fp = fopen($url, 'r')) {
        $content = '';
        while ($line = fread($fp, 1024)) {
            $content. = $line;
        }
    }
    $textLen = strlen($content);
    if ($textLen > 10) {
        $startPos = 0;
        $valid = true;
        while ($valid) {
            $spos = strpos($content, '<a ', $startPos);
            if ($spos < $startPos) $valid = false;
            $spos = strpos($content, 'href', $spos);
            $spos = strpos($content, '"', $spos) + 1;
            $epos = strpos($content, '"', $spos);
            $startPos = $epos;
            $link = substr($content, $spos, $epos - $spos);
            if (strpos($link, 'http://') !== false) $links[] = $link;
        }
    }
    return $links;
}

try this code....

0 讨论(0)

后悔当初

2020-11-30 17:56
Another alternative might be
```
Array.from(document.querySelectorAll("a")).map(x => x.href)
```
With your $$( its even shorter
```
Array.from($$("a")).map(x => x.href)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
面向向阳花

2020-11-30 18:05
If this is a programming question, then I would suggest you write your own regular expression to parse all the retrieved contents. Target tags are IMG and A for standard HTML. For JAVA,
```
final String openingTags = "(<a [^>]*href=['\"]?|<img[^> ]* src=['\"]?)";
```
this along with Pattern and Matcher classes should detect the beginning of the tags. Add LINK tag if you also want CSS.

However, it is not as easy as you may have intially thought. Many web pages are not well-formed. Extracting all the links programmatically that human being can "recognize" is really difficult if you need to take into account all the irregular expressions.

Good luck!
0 讨论(0)
发布评论:

提交评论
- 加载中...
失恋的感觉

2020-11-30 18:09

Check out linkchecker—it will crawl the site (while obeying robots.txt) and generate a report. From there, you can script up a solution for creating the directory tree.

0 讨论(0)
发布评论:

提交评论
- 加载中...
感动是毒

2020-11-30 18:14
If you have the developer console (JavaScript) in your browser, you can type this code in:
```
urls = document.querySelectorAll('a'); for (url in urls) console.log(urls[url].href);
```
Shortened:
```
n=$$('a');for(u in n)console.log(n[u].href)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...