Scrapy get all links from any website

后端 未结 2 1133
暖寄归人
暖寄归人 2021-02-09 17:02

I have the following code for a web crawler in Python 3:

import requests
from bs4 import BeautifulSoup
import re

def get_links(link):

    return_links = []

           


        
相关标签:
2条回答
  • 2021-02-09 17:29

    There is an entire section of scrapy guide dedicated to broad crawls. I suggest you to fine-grain your settings for doing this succesfully.

    For recreating the behaviour you need in scrapy, you must

    • set your start url in your page.
    • write a parse function that follow all links and recursively call itself, adding to a spider variable the requested urls

    An untested example (that can be, of course, refined):

    class AllSpider(scrapy.Spider):
        name = 'all'
    
        start_urls = ['https://yourgithub.com']
    
        def __init__(self):
            self.links=[]
    
        def parse(self, response):
            self.links.append(response.url)
            for href in response.css('a::attr(href)'):
                yield response.follow(href, self.parse)
    
    0 讨论(0)
  • 2021-02-09 17:45

    If you want to allow crawling of all domains, simply don't specify allowed_domains, and use a LinkExtractor which extracts all links.

    A simple spider that follows all links:

    class FollowAllSpider(CrawlSpider):
        name = 'follow_all'
    
        start_urls = ['https://example.com']
        rules = [Rule(LinkExtractor(), callback='parse_item', follow=True)]
    
        def parse_item(self, response):
            pass
    
    0 讨论(0)
提交回复
热议问题