Scrapy- How to extract all blog posts from a category?

问题

I am using scrapy to extract all the posts of my blog. The problem is I cannot figure out how to create a rule that reads all the posts in any given blog category?

example: On my blog the category, "Environment setup" has 17 posts. So in the scrapy code I can hard code it as given but this is not a very practical approach

start_urls=["https://edumine.wordpress.com/category/ide- configuration/environment-setup/","https://edumine.wordpress.com/category/ide-configuration/environment-setup/page/2/","https://edumine.wordpress.com/category/ide-configuration/environment-setup/page/3/"]

I have read similar posts related to this question posted here on SO like 1, 2, 3, 4, 5, 6, 7, but I cant seem to find out my answer in any. As you can see, the only difference is the page count in the above url's. How can I write a rule in scrapy that can read all the blog posts in a category. And another trivial question, how can I configure the spider to crawl my blog such that when I post a new blog post entry, the crawler can immediately detect it an write it to a file.

This is what I have so far for the spider class

from BlogScraper.items import BlogscraperItem
from scrapy.spiders import CrawlSpider,Rule
from scrapy.selector import Selector
from scrapy.linkextractors import LinkExtractor
from scrapy.http import Request


class MySpider(CrawlSpider):
    name = "nextpage" # give your spider a unique name because it will be used for crawling the webpages

    #allowed domain restricts the spider crawling
    allowed_domains=["https://edumine.wordpress.com/"]
    # in start_urls you have to specify the urls to crawl from
    start_urls=["https://edumine.wordpress.com/category/ide-configuration/environment-setup/"]

    '''
    start_urls=["https://edumine.wordpress.com/category/ide-configuration/environment-setup/",
                "https://edumine.wordpress.com/category/ide-configuration/environment-setup/page/2/",
                "https://edumine.wordpress.com/category/ide-configuration/environment-setup/page/3/"]


    rules = [
                Rule(SgmlLinkExtractor
                    (allow=("https://edumine.wordpress.com/category/ide-configuration/environment-setup/\d"),unique=False,follow=True))
            ]
'''
    rules= Rule(LinkExtractor(allow='https://edumine.wordpress.com/category/ide-configuration/environment-setup/page/'),follow=True,callback='parse_page')

    def parse_page(self, response):

        hxs=Selector(response)
        titles = hxs.xpath("//h1[@class='entry-title']")
        items = []
        with open("itemLog.csv","w") as f:
             for title in titles:
                item = BlogscraperItem()
                item["post_title"] = title.xpath("//h1[@class='entry-title']//text()").extract()
                item["post_time"] = title.xpath("//time[@class='entry-date']//text()").extract()
                item["text"]=title.xpath("//p//text()").extract()
                item["link"] = title.select("a/@href").extract()

                items.append(item)

                f.write('post title: {0}\n, post_time: {1}\n, post_text: {2}\n'.format(item['post_title'], item['post_time'],item['text']))
                print "#### \tTotal number of posts= ",len(items), " in category####"


        f.close()

Any help or suggestions to solve it?

回答1:

You have some things you can improve in your code and two problems you want to solve: reading posts, automatic crawling.

If you want to get the contents of a new blog post you have to re-run your spider. Otherwise you would have an endless loop. Naturally in this case you have to make sure that you do not scrape already scraped entries (database, read available files at spider start and so on). But you cannot have a spider which runs forever and waits for new entries. This is not the purpose.

Your approach to store the posts into a file is wrong. This means why do you scrape a list of items and then do nothing with them? And why do you save the items in the parse_page function? For this there are item pipelines, you should write one and do there the exporting. And the f.close() is not necessary because you use the with statement which does this for you at the end.

Your rules variable should throw an error because it is not iterable. I wonder if you even tested your code. And the Rule is too complex. You can simplify it to this:

rules = [Rule(LinkExtractor(allow='page/*'), follow=True, callback='parse_page'),]

And it follows every URL which has /page in it.

If you start your scraper you will see that the results are filtered because of your allowed domains:

Filtered offsite request to 'edumine.wordpress.com': <GET https://edumine.wordpress.com/category/ide-configuration/environment-setup/page/2/>

To solve this change your domain to:

allowed_domains = ["edumine.wordpress.com"]

If you want to get other wordpress sites, change it simply to

allowed_domains = ["wordpress.com"]

来源：https://stackoverflow.com/questions/32818729/scrapy-how-to-extract-all-blog-posts-from-a-category

标签

python

regex

wordpress

scrapy

scrapy-spider