Scraping all text using Scrapy without knowing webpages' structure

前端未结

关注

 1  574

长发绾君心 2021-02-09 05:44

I am conducting a research which relates to distributing the indexing of the internet.

While several such projects exist (IRLbot, Distributed-indexing, Cluster-Scrapy, C

1条回答

慢半拍i (楼主)

2021-02-09 05:48
What you are looking for here is scrapy CrawlSpider

CrawlSpider lets you define crawling rules that are followed for every page. It's smart enough to avoid crawling images, documents and other files that are not web resources and it pretty much does the whole thing for you.

Here's a good example how your spider might look with CrawlSpider:
```
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class MySpider(CrawlSpider):
    name = 'crawlspider'
    start_urls = ['http://scrapy.org']

    rules = (
        Rule(LinkExtractor(), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        item = dict()
        item['url'] = response.url
        item['title'] = response.meta['link_text']
        # extracting basic body
        item['body'] = '\n'.join(response.xpath('//text()').extract())
        # or better just save whole source
        item['source'] = response.body
        return item
```
This spider will crawl every webpage it can find on the website and log the title, url and whole text body.
For text body you might want to extract it in some smarter way(to exclude javascript and other unwanted text nodes), but that's an issue on it's own to discuss. Actually for what you are describing you probably want to save full html source rather than text only, since unstructured text is useless for any sort of analitics or indexing.

There's also bunch of scrapy settings that can be adjusted for this type of crawling. It's very nicely described in Broad Crawl docs page
0 讨论(0)
发布评论:

提交评论
- 加载中...