pdfminer | 易学教程

Extracting text from PDF page's certain areas?

阅读更多关于 Extracting text from PDF page's certain areas?

问题 I am trying to parse a PDF-book, but I only need the main body WITHOUT footers, headers or footnotes. I looked through pdfminer documentation but I haven't succeeded yet. Here is the code I use for getting the text: from pdfminer.converter import TextConverter from pdfminer.pdfinterp import PDFPageInterpreter from pdfminer.pdfinterp import PDFResourceManager from pdfminer.pdfpage import PDFPage with open(pdfname, 'rb') as fh: for page in PDFPage.get_pages(fh, caching=True, check_extractable

记一次为解决Python读取PDF文件的Shell操作

阅读更多关于记一次为解决Python读取PDF文件的Shell操作

目录一、背景二、问题三、解决四、一顿分析及 Shell 操作五、后续一、背景本想将 PDF 文件转换为 Word 文档，然后网上搜索了一下发现有挺多转换的软件。有的是免费的、收费，咱也不知哪个好使，还得一个个安装试用。先不说能不解决问题，就这安装试用想想就脑壳疼。便想起了"Python 大法"，随即搜了几篇看起来比较完整的博客，二话不说粘贴复制，改改运行试试。使用环境(python3.6+pdfminer3k)，代码这里就不放出来了。二、问题运气不好，这一试就报错 WARNING:root:GBK-EUC-H ，然后又搜了一下有同样的报错问题，但是这篇博客没啥大用，仅仅是知道缺了相关的字体文件，通过其中的链接顺藤摸瓜找到了 github 上的字体文件列表页 https://github.com/euske/pdfminer/tree/f1d5d681b6d2ab0ddeaea925ba784ebb94f6d509/pdfminer/cmap 三、解决下载了报错的对应文件 GBK-EUC-H.pickle.gz ，然后将其文件解压把放置 Python 的安装目录下 Lib\site-packages\pdfminer\cmap 路径中，再次运行又报错 "pdfminer.converter:undefined: <PDFCIDFont: basefont=

pdfminer3 extracts text from pdf without spaces

阅读更多关于 pdfminer3 extracts text from pdf without spaces

问题 pdfminer3 is simple tool for extracting text from pdf. While browsing the cite for minimal reproducible example, faced with the problem of spaces missing in extracted text. 回答1: Solution is to specify laparams next way from pdfminer3.layout import LAParams converter = TextConverter(resource_manager, fake_file_handle, laparams=LAParams()) 来源： https://stackoverflow.com/questions/58889337/pdfminer3-extracts-text-from-pdf-without-spaces

Python pdfminer extract image produces multiple images per page (should be single image)

阅读更多关于 Python pdfminer extract image produces multiple images per page (should be single image)

问题 I am attempting to extract images that are in a PDF. The file I am working with is 2+ pages. Page 1 is text and pages 2-n are images (one per page, or it may be a single image spanning multiple pages; I do not have control over the origin). I am able to parse the text out from page 1 but when I try to get the images I am getting 3 images per image page. I cannot determine the image type which makes saving it difficult. Additionally trying to save each pages 3 pictures as a single img provides

Error: cannot import name 'PDFDocument' from 'pdfminer.pdfparser'

阅读更多关于 Error: cannot import name 'PDFDocument' from 'pdfminer.pdfparser'

问题 I need to extract text from pdf-files and have used pdfminer.six with success, extracting both text paragraphs and tables. But now I get an error related to the line from pdfminer.pdfparser import PDFParser, PDFDocument: ImportError: cannot import name 'PDFDocument' from 'pdfminer.pdfparser' (C:\Users[username]\Anaconda3\lib\site-packages\pdfminer\pdfparser.py) I'm using Anaconda Jupyter. Python 3.7.3. Package pdfminer.six-20181108 The code I'm using is based on this: How to read pdf file

PDFminer: extract text with its font information

阅读更多关于 PDFminer: extract text with its font information

问题 I find this question, but it uses command line, and I do not want to call a Python script in command line using subprocess and parse HTML files to get the font information. I want to use PDFminer as a library, and I find this question, but they are just all about extracting plain texts, without other information such as font name, font size, and so on. 回答1: #!/usr/bin/env python from pdfminer.pdfparser import PDFParser from pdfminer.pdfdocument import PDFDocument from pdfminer.pdfpage import

Highlight text in a PDF with Python [closed]

阅读更多关于 Highlight text in a PDF with Python [closed]

问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed last year . I'm working on custom search engine for my PDF data corpus. I have a transformation layer which is able to dump PDF content to text (using Apache Tika and GROBID). I have finished search layers and the view which return search results listing. Now, I'd like to add highlighting feature on original PDF for the lines

How can I fix 'UnicodeDecodeError' when trying to extract text with pdfminer.six?

阅读更多关于 How can I fix 'UnicodeDecodeError' when trying to extract text with pdfminer.six?

问题 I get a UnicodeEncodeError when using pdfminer (the latest version from git) installed via pip install git+https://github.com/pdfminer/pdfminer.six.git : Traceback (most recent call last): File "pdfminer_sample3.py", line 34, in <module> print(convert_pdf_to_txt("samples/numbers-test-document.pdf")) File "pdfminer_sample3.py", line 27, in convert_pdf_to_txt text = retstr.getvalue() File "/usr/lib/python2.7/StringIO.py", line 271, in getvalue self.buf += ''.join(self.buflist)

pdfminer - import error

阅读更多关于 pdfminer - import error

问题 I’m new to Python and programming in general. I am trying to install pdfMiner. I have Windows 7 with Python 2.7 installed. I followed the instructions when installing (downloaded the PDFMiner source, unpacked it, and ran setup.py to install and it was installed in C:\pdfminer ) – no errors. It created: build , build\lib,build\lib\pdfminer , then copied files to build\lib\pdfminer . It created build\scripts-2.7 and copied tools\pdf2txt.py , dymppdf.py to scripts , then wrote install-egg-info

read pdf file horizontally with pdfminer

阅读更多关于 read pdf file horizontally with pdfminer

问题 I would like to extract a pdf with pdfminer (version 20140328). This is the code to extract the pdf: import sys from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter from pdfminer.pdfpage import PDFPage from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter from pdfminer.layout import LAParams from cStringIO import StringIO import urllib2 def pdf_to_string(data): fp = StringIO(data) rsrcmgr = PDFResourceManager() retstr = StringIO() codec = 'utf-8' laparams