pdfminer

python爬虫处理在线预览的pdf文档

会有一股神秘感。 提交于 2020-04-09 11:24:12
引言 最近在爬一个网站,然后爬到详情页的时候发现,目标内容是用pdf在线预览的 比如如下网站: https://camelot-py.readthedocs.io/en/master/_static/pdf/foo.pdf 根据我的分析发现,这样的在线预览pdf的采用了pdfjs加载预览,用爬虫的方法根本无法直接拿到pdf内的内容的,对的,你注意到了我说的【根本无法直接拿到】中的直接两个字,确实直接无法拿到,怎么办呢?只能把pdf先下载到本地,然后用工具转了,经过我查阅大量的相关资料发现,工具还是有很多:   1.借用第三方的pdf转换网站转出来   2.使用Python的包来转:如:pyPdf,pyPdf2,pyPdf4,pdfrw等工具 这些工具在pypi社区一搜一大把: 但是效果怎么样就不知道了,只能一个一个去试了,到后面我终于找到个库,非常符合我的需求的库 ——camelot camelot可以读取pdf文件中的数据,并且自动转换成pandas库(数据分析相关)里的DataFrame类型,然后可以通过DataFrame转为csv,json,html都行,我的目标要的就是转为html格式,好,废话不多说,开始搞 开始解析 1.安装camelot: pip install camelot-py pip install cv2 (因为camelot需要用到这个库) 2.下载pdf

Pdf Miner returns weird letters/characters

拜拜、爱过 提交于 2020-03-23 09:50:58
问题 I am using pdfminer with python 3 and I get weird letters in the text that is recovered from the pdf. For instance, I get significant instead of significant (notice that the letters f and I are merged into one). I have no idea why this is happening. This is the code I am using. from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter from pdfminer.converter import TextConverter from pdfminer.layout import LAParams from pdfminer.pdfpage import PDFPage from io import StringIO from

Extracting text written in hindi from pdf in python [duplicate]

不羁岁月 提交于 2020-02-28 22:16:08
问题 This question already has answers here : Unable to copy exact hindi content from pdf (1 answer) Read PDF using itextsharp where PDF language is non-English (2 answers) Parsing a pdf(Devanagari script) using PDFminer gives incorrect output [duplicate] (1 answer) Closed 2 years ago . I want to extract text typed in hindi from a pdf document.I've attached the image of the sample page I am dealing with. I've tried using pdfminer to get text from it but the text is garbled (may be due to hindi

用于将PDF转换为文本的Python模块[关闭]

随声附和 提交于 2020-02-28 07:54:13
哪些是将PDF文件转换为文本的最佳Python模块? #1楼 该 PDFMiner 包已经改变,因为 codeape 公布。 编辑(再次): PDFMiner已在版本 20100213 再次更新 您可以使用以下内容检查已安装的版本: >>> import pdfminer >>> pdfminer.__version__ '20100213' 这是更新版本(包含我更改/添加内容的评论): def pdf_to_csv(filename): from cStringIO import StringIO #<-- added so you can copy/paste this to try it from pdfminer.converter import LTTextItem, TextConverter from pdfminer.pdfparser import PDFDocument, PDFParser from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter class CsvConverter(TextConverter): def __init__(self, *args, **kwargs): TextConverter.__init__(self, *args, **kwargs)

Can't install pdfminer.six on Windows 10

徘徊边缘 提交于 2020-01-11 10:21:47
问题 On my cmd window, I typed pip install pdfminer.six and it gives me these errors. Microsoft Windows [Version 10.0.15063] (c) 2017 Microsoft Corporation. All rights reserved. C:\Users\Eric Kim>pip install pdfminer.six Collecting pdfminer.six Retrying (Retry(total=4, connect=None, read=None, redirect=None)) after connection broken by 'ProxyError('Cannot connect to proxy.', NewConnectionError('<pip._vendor.requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x04435730>: Failed

Can't install pdfminer.six on Windows 10

烈酒焚心 提交于 2020-01-11 10:21:43
问题 On my cmd window, I typed pip install pdfminer.six and it gives me these errors. Microsoft Windows [Version 10.0.15063] (c) 2017 Microsoft Corporation. All rights reserved. C:\Users\Eric Kim>pip install pdfminer.six Collecting pdfminer.six Retrying (Retry(total=4, connect=None, read=None, redirect=None)) after connection broken by 'ProxyError('Cannot connect to proxy.', NewConnectionError('<pip._vendor.requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x04435730>: Failed

extraction of text from pdf with pdfminer gives multiple copies

杀马特。学长 韩版系。学妹 提交于 2019-12-30 11:01:07
问题 I am trying to extract text from a PDF file using PDFMiner (the code found at Extracting text from a PDF file using PDFMiner in python?). I didn't change the code except path/to/pdf. Surprisingly, the code returns several copies of the same document. I got the same result with other pdf files. Do I need to pass other arguments or I am missing something? Any help is highly appreciated. Just in case, I provide the code: from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter from

Python read part of a pdf page

落爺英雄遲暮 提交于 2019-12-24 16:53:19
问题 I'm trying to read a pdf file where each page is divided into 3x3 blocks of information of the form A | B | C D | E | F G | H | I Each of the entries is broken into multiple lines. A simplified example of one entry is this card. But then there would be similar entries in the other 8 slots. I've looked at pdfminer and pypdf2. I haven't found pdfminer overly useful, but pypdf2 has given me something close. import PyPDF2 from StringIO import StringIO def getPDFContent(path): content = "" p =

PDF转成txt

大兔子大兔子 提交于 2019-12-22 12:25:33
#Author:Alex.Zhang import pyocr import importlib import sys import time importlib.reload( sys ) time1 = time.time() # print("初始时间为:",time1) import os.path from pdfminer.pdfparser import PDFParser , PDFDocument from pdfminer.pdfinterp import PDFResourceManager , PDFPageInterpreter from pdfminer.converter import PDFPageAggregator from pdfminer.layout import LTTextBoxHorizontal , LAParams from pdfminer.pdfinterp import PDFTextExtractionNotAllowed text_path = r'parameters in cryo-EM.pdf' # text_path = r'photo-words.pdf' def parse(): '''解析PDF文本,并保存到TXT文件中''' fp = open( text_path , 'rb' ) #

struct.error: unpack requires a string argument of length 16

我们两清 提交于 2019-12-22 04:43:19
问题 While processing a PDF file (2.pdf) with pdfminer (pdf2txt.py) I received the following error: pdf2txt.py 2.pdf Traceback (most recent call last): File "/usr/local/bin/pdf2txt.py", line 115, in <module> if __name__ == '__main__': sys.exit(main(sys.argv)) File "/usr/local/bin/pdf2txt.py", line 109, in main interpreter.process_page(page) File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 832, in process_page self.render_contents(page.resources, page.contents, ctm=ctm)