Converting a pdf to text/html in python so I can parse it

前端 未结 2 1120
梦毁少年i
梦毁少年i 2021-02-06 14:28

I have the following sample code where I download a pdf from the European Parliament website on a given legislative proposal:

EDIT: I ended up just getting the link and

相关标签:
2条回答
  • 2021-02-06 14:38

    It's not exactly magic. I suggest

    • downloading the PDF file to a temp directory,
    • calling out to an external program to extract the text into a (temp) text file,
    • reading the text file.

    For text extraction command-line utilities you have a number of possibilities and there may be others not mentioned in the link (perhaps Java-based). Try them first to see if they fit your needs. That is, try each step separately (finding the links, downloading the files, extracting the text) and then piece them together. For calling out, use subprocess.Popen or subprocess.call().

    0 讨论(0)
  • 2021-02-06 14:41

    Sounds like you found a solution, but if you ever want to do it without a web service, or you need to scrape data based on its precise location on the PDF page, can I suggest my library, pdfquery? It basically turns the PDF into an lxml tree that can be spit out as XML, or parsed with XPath, PyQuery, or whatever else you want to use.

    To use it, once you had the file saved to disk you would return pdf = pdfquery.PDFQuery(name_pdf), or pass in a urllib file object directly if you didn't need to save it. To get XML out to parse with BeautifulSoup, you could do pdf.tree.tostring().

    If you don't mind using JQuery-style selectors, there's a PyQuery interface with positional extensions, which can be pretty handy. For example:

    balance = pdf.pq(':contains("Your balance is")').text()
    strings_near_the_bottom_of_page_23 = [el.text for el in pdf.pq('LTPage[page_label=23] :in_bbox(0, 0, 600, 200)')]
    
    0 讨论(0)
提交回复
热议问题