pdfminer

Python text extraction does not work on some pdfs

大兔子大兔子 提交于 2019-12-03 21:25:44
I am trying to read a pdf through url. I followed many stackoverflow suggestions and used PyPdf2 FileReader to extract text from the pdf. My code looks like this : url = "http://kat.kar.nic.in:8080/uploadedFiles/C_13052015_ch1_l1.pdf" #url = "http://kat.kar.nic.in:8080/uploadedFiles/C_06052015_ch1_l1.pdf" f = urlopen(Request(url)).read() fileInput = StringIO(f) pdf = PyPDF2.PdfFileReader(fileInput) print pdf.getNumPages() print pdf.getDocumentInfo() print pdf.getPage(1).extractText() I am able to successfully extract text for first link. But if I use the same program for the second pdf. I do

How does one obtain the location of text in a PDF with PDFMiner?

╄→尐↘猪︶ㄣ 提交于 2019-12-03 18:40:38
问题 PDFMiner's documentation says: PDFMiner allows one to obtain the exact location of text in a page However, I have not been able to find how to do this. PDFMiner's 'documentation' is rather sparse, so I have not understood how to do this. 回答1: You are looking for the bbox property on every layout object. There is a little bit of information on how to parse the layout hierarchy in the PDFMiner documentation, but it doesn't cover everything. Here's an example: from pdfminer.pdfdocument import

Extracting tables from a pdf

痴心易碎 提交于 2019-12-03 16:03:07
I'm trying to get the data from the tables in this PDF . I've tried pdfminer and pypdf with a little luck but I can't really get the data from the tables. This is what one of the tables looks like: As you can see, some columns are marked with an 'x'. I'm trying to this table into a list of objects. This is the code so far, I'm using pdfminer now. # pdfminer test from pdfminer.pdfdocument import PDFDocument from pdfminer.pdfparser import PDFParser from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter from pdfminer.pdfdevice import PDFDevice, TagExtractor from pdfminer.pdfpage

struct.error: unpack requires a string argument of length 16

匿名 (未验证) 提交于 2019-12-03 08:44:33
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: While processing a PDF file (2.pdf) with pdfminer (pdf2txt.py) I received the following error: pdf2txt.py 2.pdf Traceback (most recent call last): File "/usr/local/bin/pdf2txt.py", line 115, in <module> if __name__ == '__main__': sys.exit(main(sys.argv)) File "/usr/local/bin/pdf2txt.py", line 109, in main interpreter.process_page(page) File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 832, in process_page self.render_contents(page.resources, page.contents, ctm=ctm) File "/usr/local/lib/python2.7/dist-packages/pdfminer

How to unlock a “secured” (read-protected) PDF in Python?

依然范特西╮ 提交于 2019-12-03 05:09:16
问题 In Python I'm using pdfminer to read the text from a pdf with the code below this message. I now get an error message saying: File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfpage.py", line 124, in get_pages raise PDFTextExtractionNotAllowed('Text extraction is not allowed: %r' % fp) PDFTextExtractionNotAllowed: Text extraction is not allowed: <cStringIO.StringO object at 0x7f79137a1 ab0> When I open this pdf with Acrobat Pro it turns out it is secured (or "read protected"). From this

Highlight text in a PDF with Python [closed]

喜夏-厌秋 提交于 2019-12-03 03:32:08
I'm working on custom search engine for my PDF data corpus. I have a transformation layer which is able to dump PDF content to text (using Apache Tika and GROBID). I have finished search layers and the view which return search results listing. Now, I'd like to add highlighting feature on original PDF for the lines, where search terms was appeared. Yes, I wanna modifiy PDF files if it is necessary. Is there any way for highlight text inside in PDF file? Are PDFMiner or PyPDF2 or other Python library is able to do that? ... or can you recommand other, maybe external service for it? spacevillain

python提取pdf文本内容

匿名 (未验证) 提交于 2019-12-02 22:56:40
安装: pip install pdfminer 解析pdf文件用到的类: PDFParser:从一个文件中获取数据 PDFDocument:保存获取的数据,和PDFParser是相互关联的 PDFPageInterpreter处理页面内容 PDFDevice将其翻译成你需要的格式 PDFResourceManager用于存储共享资源,如字体或图像。 PDFMiner的类之间的关系图: LTCurve:表示一个通用的Bezier曲线 方法1 # -*- coding:utf-8 -*- import time,os.path,requests,re time1=time.time() from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter from pdfminer.converter import PDFPageAggregator from pdfminer.layout import LAParams,LTTextBoxHorizontal from pdfminer.pdfpage import PDFTextExtractionNotAllowed,PDFPage from pdfminer.pdfparser import PDFParser from pdfminer

How to unlock a “secured” (read-protected) PDF in Python?

倾然丶 夕夏残阳落幕 提交于 2019-12-02 20:52:23
In Python I'm using pdfminer to read the text from a pdf with the code below this message. I now get an error message saying: File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfpage.py", line 124, in get_pages raise PDFTextExtractionNotAllowed('Text extraction is not allowed: %r' % fp) PDFTextExtractionNotAllowed: Text extraction is not allowed: <cStringIO.StringO object at 0x7f79137a1 ab0> When I open this pdf with Acrobat Pro it turns out it is secured (or "read protected"). From this link however, I read that there's a multitude of services which can disable this read-protection

Redirect output of a function that converts pdf to txt files to a new folder in python

丶灬走出姿态 提交于 2019-12-02 19:04:20
问题 I am using python 3. My code uses pdfminer to convert pdf to text. I want to get the output of these files in a new folder. Currently it's coming in the existing folder from which it does the conversion to .txt using pdfminer. How do I redirect the output to a different folder. I want the output in a folder called "D:\extracted_text" Code till now: from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter from pdfminer.converter import TextConverter from pdfminer.layout import

How to use PDFminer.six with python 3?

强颜欢笑 提交于 2019-12-02 10:38:11
问题 I want to use pdfminer.six which is for python 3 to extract pdf. The problem is there is no good documentation at all and no source code example on how to use it. I have already tried some code from StackOverflow but did not work. My code is as below. from pdfminer.converter import TextConverter from pdfminer.layout import LAParams from pdfminer.pdfpage import PDFPage from io import StringIO def convert_pdf_to_txt(path): rsrcmgr = PDFResourceManager() retstr = StringIO() codec = 'utf-8'