pdfminer

Extracting text from PDF page's certain areas?

好久不见. 提交于 2019-12-13 03:10:19
问题 I am trying to parse a PDF-book, but I only need the main body WITHOUT footers, headers or footnotes. I looked through pdfminer documentation but I haven't succeeded yet. Here is the code I use for getting the text: from pdfminer.converter import TextConverter from pdfminer.pdfinterp import PDFPageInterpreter from pdfminer.pdfinterp import PDFResourceManager from pdfminer.pdfpage import PDFPage with open(pdfname, 'rb') as fh: for page in PDFPage.get_pages(fh, caching=True, check_extractable

记一次为解决Python读取PDF文件的Shell操作

旧时模样 提交于 2019-12-12 23:18:06
目录 一、背景 二、问题 三、解决 四、一顿分析及 Shell 操作 五、后续 一、背景 本想将 PDF 文件转换为 Word 文档,然后网上搜索了一下发现有挺多转换的软件。有的是免费的、收费,咱也不知哪个好使,还得一个个安装试用。先不说能不解决问题,就这安装试用想想就脑壳疼。便想起了"Python 大法",随即搜了几篇看起来 比较完整的博客 ,二话不说粘贴复制,改改运行试试。使用环境(python3.6+pdfminer3k),代码这里就不放出来了。 二、问题 运气不好,这一试就报错 WARNING:root:GBK-EUC-H ,然后又搜了一下有 同样的报错问题 ,但是这篇博客没啥大用,仅仅是知道缺了相关的字体文件,通过其中的链接顺藤摸瓜找到了 github 上的字体文件列表页 https://github.com/euske/pdfminer/tree/f1d5d681b6d2ab0ddeaea925ba784ebb94f6d509/pdfminer/cmap 三、解决 下载了报错的对应文件 GBK-EUC-H.pickle.gz ,然后将其文件解压把放置 Python 的安装目录下 Lib\site-packages\pdfminer\cmap 路径中,再次运行又报错 "pdfminer.converter:undefined: <PDFCIDFont: basefont=

pdfminer3 extracts text from pdf without spaces

徘徊边缘 提交于 2019-12-11 04:25:51
问题 pdfminer3 is simple tool for extracting text from pdf. While browsing the cite for minimal reproducible example, faced with the problem of spaces missing in extracted text. 回答1: Solution is to specify laparams next way from pdfminer3.layout import LAParams converter = TextConverter(resource_manager, fake_file_handle, laparams=LAParams()) 来源: https://stackoverflow.com/questions/58889337/pdfminer3-extracts-text-from-pdf-without-spaces

Python pdfminer extract image produces multiple images per page (should be single image)

徘徊边缘 提交于 2019-12-11 02:19:26
问题 I am attempting to extract images that are in a PDF. The file I am working with is 2+ pages. Page 1 is text and pages 2-n are images (one per page, or it may be a single image spanning multiple pages; I do not have control over the origin). I am able to parse the text out from page 1 but when I try to get the images I am getting 3 images per image page. I cannot determine the image type which makes saving it difficult. Additionally trying to save each pages 3 pictures as a single img provides

Error: cannot import name 'PDFDocument' from 'pdfminer.pdfparser'

我们两清 提交于 2019-12-10 23:38:57
问题 I need to extract text from pdf-files and have used pdfminer.six with success, extracting both text paragraphs and tables. But now I get an error related to the line from pdfminer.pdfparser import PDFParser, PDFDocument: ImportError: cannot import name 'PDFDocument' from 'pdfminer.pdfparser' (C:\Users[username]\Anaconda3\lib\site-packages\pdfminer\pdfparser.py) I'm using Anaconda Jupyter. Python 3.7.3. Package pdfminer.six-20181108 The code I'm using is based on this: How to read pdf file

PDFminer: extract text with its font information

人走茶凉 提交于 2019-12-09 15:55:53
问题 I find this question, but it uses command line, and I do not want to call a Python script in command line using subprocess and parse HTML files to get the font information. I want to use PDFminer as a library, and I find this question, but they are just all about extracting plain texts, without other information such as font name, font size, and so on. 回答1: #!/usr/bin/env python from pdfminer.pdfparser import PDFParser from pdfminer.pdfdocument import PDFDocument from pdfminer.pdfpage import

Highlight text in a PDF with Python [closed]

倖福魔咒の 提交于 2019-12-09 04:53:18
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed last year . I'm working on custom search engine for my PDF data corpus. I have a transformation layer which is able to dump PDF content to text (using Apache Tika and GROBID). I have finished search layers and the view which return search results listing. Now, I'd like to add highlighting feature on original PDF for the lines

How can I fix 'UnicodeDecodeError' when trying to extract text with pdfminer.six?

◇◆丶佛笑我妖孽 提交于 2019-12-08 01:38:57
问题 I get a UnicodeEncodeError when using pdfminer (the latest version from git) installed via pip install git+https://github.com/pdfminer/pdfminer.six.git : Traceback (most recent call last): File "pdfminer_sample3.py", line 34, in <module> print(convert_pdf_to_txt("samples/numbers-test-document.pdf")) File "pdfminer_sample3.py", line 27, in convert_pdf_to_txt text = retstr.getvalue() File "/usr/lib/python2.7/StringIO.py", line 271, in getvalue self.buf += ''.join(self.buflist)

pdfminer - import error

一世执手 提交于 2019-12-07 16:55:51
问题 I’m new to Python and programming in general. I am trying to install pdfMiner. I have Windows 7 with Python 2.7 installed. I followed the instructions when installing (downloaded the PDFMiner source, unpacked it, and ran setup.py to install and it was installed in C:\pdfminer ) – no errors. It created: build , build\lib,build\lib\pdfminer , then copied files to build\lib\pdfminer . It created build\scripts-2.7 and copied tools\pdf2txt.py , dymppdf.py to scripts , then wrote install-egg-info

read pdf file horizontally with pdfminer

半城伤御伤魂 提交于 2019-12-07 15:06:47
问题 I would like to extract a pdf with pdfminer (version 20140328). This is the code to extract the pdf: import sys from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter from pdfminer.pdfpage import PDFPage from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter from pdfminer.layout import LAParams from cStringIO import StringIO import urllib2 def pdf_to_string(data): fp = StringIO(data) rsrcmgr = PDFResourceManager() retstr = StringIO() codec = 'utf-8' laparams