pdfminer

decode CID font codes to equivalent ASCII characters

拈花ヽ惹草 提交于 2019-12-07 12:03:14
问题 I'm trying to mine some text from a bunch of PDFs and a few of them have embedded CID fonts in the output: (cid:80)(cid:72)(cid:87)(cid:68)(cid:70)(cid:76)(cid:87)(cid:76)(cid:72)(cid:86)(cid:3) (cid:177)(cid:3)(cid:71)(cid:72)(cid:191)(cid:81)(cid:72)(cid:71)(cid:3)(cid:69)(cid:92 (cid:3)(cid:56)(cid:49)(cid:3)(cid:43)(cid:68)(cid:69)(cid:76)(cid:87)(cid:68)(cid:87) (cid:3)(cid:68)(cid:86)(cid:3)(cid:70)(cid:76)(cid:87)(cid:76)(cid:72)(cid:86)(cid:3) (cid:90)(cid:76)(cid:87)(cid:75)(cid:3)

How can I fix 'UnicodeDecodeError' when trying to extract text with pdfminer.six?

邮差的信 提交于 2019-12-06 11:37:54
I get a UnicodeEncodeError when using pdfminer (the latest version from git ) installed via pip install git+https://github.com/pdfminer/pdfminer.six.git : Traceback (most recent call last): File "pdfminer_sample3.py", line 34, in <module> print(convert_pdf_to_txt("samples/numbers-test-document.pdf")) File "pdfminer_sample3.py", line 27, in convert_pdf_to_txt text = retstr.getvalue() File "/usr/lib/python2.7/StringIO.py", line 271, in getvalue self.buf += ''.join(self.buflist) UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128) How can I fix that?

pdfminer3k has no method named create_pages in PDFPage

喜夏-厌秋 提交于 2019-12-06 04:01:35
问题 Since I want to move from python 2 to 3, I tried to work with pdfmine.3kr in python 3.4. It seems like they have edited everything. Their change logs do not reflect the changes they have done but I had no success in parsing pdf with pdfminer3k. For example: They have moved PDFDocument into pdfparser (sorry, if I spell incorrectly). PDFPage used to have create_pages method which is gone now. All I can see inside PDFPage are internal methods. Does anybody has a working example of pdfminer3k? It

read pdf file horizontally with pdfminer

别来无恙 提交于 2019-12-06 02:35:24
I would like to extract a pdf with pdfminer (version 20140328). This is the code to extract the pdf: import sys from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter from pdfminer.pdfpage import PDFPage from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter from pdfminer.layout import LAParams from cStringIO import StringIO import urllib2 def pdf_to_string(data): fp = StringIO(data) rsrcmgr = PDFResourceManager() retstr = StringIO() codec = 'utf-8' laparams = LAParams() device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams) # Create a PDF

pdfminer - import error

柔情痞子 提交于 2019-12-06 01:26:23
I’m new to Python and programming in general. I am trying to install pdfMiner. I have Windows 7 with Python 2.7 installed. I followed the instructions when installing (downloaded the PDFMiner source, unpacked it, and ran setup.py to install and it was installed in C:\pdfminer ) – no errors. It created: build , build\lib,build\lib\pdfminer , then copied files to build\lib\pdfminer . It created build\scripts-2.7 and copied tools\pdf2txt.py , dymppdf.py to scripts , then wrote install-egg-info and had no errors with install. When I try running the test document from the command line: C:\pdfminer

Python text extraction does not work on some pdfs

梦想与她 提交于 2019-12-05 07:39:25
问题 I am trying to read a pdf through url. I followed many stackoverflow suggestions and used PyPdf2 FileReader to extract text from the pdf. My code looks like this : url = "http://kat.kar.nic.in:8080/uploadedFiles/C_13052015_ch1_l1.pdf" #url = "http://kat.kar.nic.in:8080/uploadedFiles/C_06052015_ch1_l1.pdf" f = urlopen(Request(url)).read() fileInput = StringIO(f) pdf = PyPDF2.PdfFileReader(fileInput) print pdf.getNumPages() print pdf.getDocumentInfo() print pdf.getPage(1).extractText() I am

python读取PDF文件内容

走远了吗. 提交于 2019-12-04 11:32:35
1 import os 2 from pdfminer.pdfparser import PDFParser 3 from pdfminer.pdfdocument import PDFDocument 4 from pdfminer.pdfpage import PDFPage 5 from pdfminer.pdfpage import PDFTextExtractionNotAllowed 6 from pdfminer.pdfinterp import PDFResourceManager 7 from pdfminer.pdfinterp import PDFPageInterpreter 8 from pdfminer.pdfdevice import PDFDevice 9 from pdfminer.layout import * 10 from pdfminer.converter import PDFPageAggregator 11 12 13 import os 14 import pdb 15 16 #inputFile = r'D:\用户目录\桌面\340xxxxxxxxxxxxxxxxxx0.pdf' 17 18 19 def decode_text(s): 20 """ 21 Decodes a PDFDocEncoding string to

pdfminer3k has no method named create_pages in PDFPage

爱⌒轻易说出口 提交于 2019-12-04 07:46:46
Since I want to move from python 2 to 3, I tried to work with pdfmine.3kr in python 3.4. It seems like they have edited everything. Their change logs do not reflect the changes they have done but I had no success in parsing pdf with pdfminer3k. For example: They have moved PDFDocument into pdfparser (sorry, if I spell incorrectly). PDFPage used to have create_pages method which is gone now. All I can see inside PDFPage are internal methods. Does anybody has a working example of pdfminer3k? It seems like there is no new documentation to reflect any of the changes. CPB If you are interested in

PDFminer: extract text with its font information

孤街醉人 提交于 2019-12-04 03:52:53
I find this question , but it uses command line, and I do not want to call a Python script in command line using subprocess and parse HTML files to get the font information. I want to use PDFminer as a library, and I find this question , but they are just all about extracting plain texts, without other information such as font name, font size, and so on. #!/usr/bin/env python from pdfminer.pdfparser import PDFParser from pdfminer.pdfdocument import PDFDocument from pdfminer.pdfpage import PDFPage from pdfminer.pdfinterp import PDFResourceManager from pdfminer.pdfinterp import

What to do with CIDs in text extracted by PDFMiner?

笑着哭i 提交于 2019-12-03 23:07:35
问题 I've some PDFs which are in Hindi, and have extractable text. I used pdfminer.six for python 3.6, to do the extraction. The output looks like: As one can see, there are a number of characters that are converted into the form "(cid :number)". On further analysis, I found out that a PDF contains CMAPs which map character codes to glyph indices. So, a CID is a character identity for the glyph it maps to, inside the CMAP table. But how are these character codes related to Unicode values?