Bulk download of pdf with Scrapy and Python3

与世无争的帅哥 提交于 2019-12-23 04:30:05

问题


I would like to bulk download free-to-download pdfs (copies of an old newspaper from 1843 to 1900 called Gaceta) from this website of the Nicaraguan National Assembly with Python3/Scrapy.

I am a absolute beginner in programming and python, but tried to start with a(n unfinished) script:

#!/usr/bin/env python3

from urllib.parse import urlparse
import scrapy

from scrapy.http import Request

class gaceta(scrapy.Spider):
    name = "gaceta"

    allowed_domains = ["digesto.asamblea.gob.ni"]
    start_urls = ["http://digesto.asamblea.gob.ni/consultas/coleccion/"]

    def parse(self, response):
        for href in response.css('div#gridTableDocCollection::attr(href)').extract():
            yield Request(
                url=response.urljoin(href),
                callback=self.parse_article
            )

    def parse_article(self, response):
        for href in response.css('div.download_wrapper a[href$=".pdf"]::attr(href)').extract():
            yield Request(
                url=response.urljoin(href),
                callback=self.save_pdf
            )

The link to each issue has some gibberish in it, so they cannot be anticipated and each link has to be searched within the source code, see for example the links to the first four available issues of the said newspaper (not every days there was a copy issued):

#06/07/1843
http://digesto.asamblea.gob.ni/consultas/util/pdf.php?type=rdd&rdd=nYgT5Rcvs2I%3D

#13/07/1843
http://digesto.asamblea.gob.ni/consultas/util/pdf.php?type=rdd&rdd=3sAxsKCA6Bo%3D

#28/07/1843
http://digesto.asamblea.gob.ni/consultas/util/pdf.php?type=rdd&rdd=137YSPeIXg8%3D

#08/08/1843
http://digesto.asamblea.gob.ni/consultas/util/pdf.php?type=rdd&rdd=aTvB%2BZpqoMw%3D

My problem is that I cannot get a working script together.

I would like to have my script to:

a) search each pdf link within the table which appears after a search (called within the website source code "tableDocCollection"). The actual link sits behind the "Acciones" button (xpath of the first issue //*[@id="tableDocCollection"]/tbody/tr[1]/td[5]/div/ul/li[1]/a)

b) to display the name of the issue it is downloading and which can be found behind the "Acciones" button (path of the name to be displayed of the first issue //*[@id="tableDocCollection"]/tbody/tr[1]/td[5]/div/ul/li[2]/a).

The major problems I run into when writing the script are:

1) that the link of the website does not change when I type in the search. So it seems that I have to tell Scrapy to insert the appropriate search terms (check mark "Búsqueda avanzada", "Colección: Dario Oficial", "Medio de Publicación: La Gaceta", time interval "06/07/1843 to 31/12/1900")?

2) that I do not know how each pdf link can be found?

How can I update the above script so that I can download all PDF in the range of 06/07/1843 to 31/12/1900?

Edit:

#!/usr/bin/env python3
from urllib.parse import urlparse
import scrapy

from scrapy.http import Request

frmdata = {"rdds":[{"rddid":"+1RiQw3IehE=","anio":"","fecPublica":"","numPublica":"","titulo":"","paginicia":null,"norma":null,"totalRegistros":"10"}
url = "http://digesto.asamblea.gob.ni/consultas/coleccion/"
r = FormRequest(url, formdata=frmdata)
fetch(r)

yield FormRequest(url, callback=self.parse, formdata=frmdata)

回答1:


# -*- coding: utf-8 -*-
import errno
import json
import os

import scrapy
from scrapy import FormRequest, Request


class AsambleaSpider(scrapy.Spider):
    name = 'asamblea'
    allowed_domains = ['asamblea.gob.ni']
    start_urls = ['http://digesto.asamblea.gob.ni/consultas/coleccion/']

    papers = {
    #    "Diario de Circulación Nacional" : "176",
        "Diario Oficial": "28",
    #    "Obra Bibliográfica": "31",
    #    "Otro": "177",
    #    "Texto de Instrumentos Internacionales": "103"
    }

    def parse(self, response):

        for key, value in list(self.papers.items()):
            yield FormRequest(url='http://digesto.asamblea.gob.ni/consultas/util/ws/proxy.php',
                  headers= {
                      'X-Requested-With': 'XMLHttpRequest'
                  }, formdata= {
                        'hddQueryType': 'initgetRdds',
                        'cole': value
                    }
                    , meta={'paper': key},
                    callback=self.parse_rdds
                )
        pass

    def parse_rdds(self, response):
        data = json.loads(response.body_as_unicode())
        for r in data["rdds"]:
            r['paper'] = response.meta['paper']
            rddid = r['rddid']
            yield Request("http://digesto.asamblea.gob.ni/consultas/util/pdf.php?type=rdd&rdd=" + rddid,
                          callback=self.download_pdf, meta=r)

    def download_pdf(self, response):
       filename = "{paper}/{anio}/".format(**response.meta) + "{titulo}-{fecPublica}.pdf".format(**response.meta).replace("/", "_")
       if not os.path.exists(os.path.dirname(filename)):
           try:
               os.makedirs(os.path.dirname(filename))
           except OSError as exc:  # Guard against race condition
               if exc.errno != errno.EEXIST:
                   raise

       with open(filename, 'wb') as f:
           f.write(response.body)

My laptop is out for repair, and on the spare Windows laptop I am not able to install Scrapy with Python3. But I am pretty sure this should do job



来源:https://stackoverflow.com/questions/50153798/bulk-download-of-pdf-with-scrapy-and-python3

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!