Whitespace gone from PDF extraction, and strange word interpretation

余生颓废 提交于 2019-11-26 16:31:57

问题


Using the snippet below, I've attempted to extract the text data from this PDF file.

import pyPdf

def get_text(path):
    # Load PDF into pyPDF
    pdf = pyPdf.PdfFileReader(file(path, "rb"))
    # Iterate pages
    content = ""
    for i in range(0, pdf.getNumPages()):
        content += pdf.getPage(i).extractText() + "\n"  # Extract text from page and add to content
    # Collapse whitespace
    content = " ".join(content.replace(u"\xa0", " ").strip().split())
    return content

The output I obtain, however,is devoid of whitespace between most of the words. This makes it difficult to perform natural language processing on the text (my ultimate goal, here).

Also, the 'fi' in the word 'finger' is consistently interpreted as something else. This is rather problematic since this paper is about spontaneous finger movements...

Does anybody know why this might be happening? I don't even know where to start!


回答1:


Your PDF file doesn't have printable space characters, it simply positions the words where they need to go. You'll have to do extra work to figure out the spaces, perhaps by assuming multi-character runs are words, and put spaces between them.

If you can select text in the PDF reader, and have spaces appear properly, then at least you know there is enough information to reconstruct the text.

"fi" is a typographic ligature, shown as a single character. You may find this is also happening with "fl", "ffi", and "ffl". You can use string replacement to substitute "fi" for the fi ligature.




回答2:


Without using the PyPdf2 use Pdfminer library package which has same functionality, as bellow. I got the code from this and as i wanted I edited it, this code gives me a text file which has white-space among words. I work with anaconda and python 3.6. for install PdfMiner for python 3.6 you can use this link.

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO

class PdfConverter:

   def __init__(self, file_path):
       self.file_path = file_path
# convert pdf file to a string which has space among words 
   def convert_pdf_to_txt(self):
       rsrcmgr = PDFResourceManager()
       retstr = StringIO()
       codec = 'utf-8'  # 'utf16','utf-8'
       laparams = LAParams()
       device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
       fp = open(self.file_path, 'rb')
       interpreter = PDFPageInterpreter(rsrcmgr, device)
       password = ""
       maxpages = 0
       caching = True
       pagenos = set()
       for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password, caching=caching, check_extractable=True):
           interpreter.process_page(page)
       fp.close()
       device.close()
       str = retstr.getvalue()
       retstr.close()
       return str
# convert pdf file text to string and save as a text_pdf.txt file
   def save_convert_pdf_to_txt(self):
       content = self.convert_pdf_to_txt()
       txt_pdf = open('text_pdf.txt', 'wb')
       txt_pdf.write(content.encode('utf-8'))
       txt_pdf.close()
if __name__ == '__main__':
    pdfConverter = PdfConverter(file_path='sample.pdf')
    print(pdfConverter.convert_pdf_to_txt())



回答3:


As an alternative to PyPDF2, I suggest pdftotext:

#!/usr/bin/env python

"""Use pdftotext to extract text from PDFs."""

import pdftotext

with open("foobar.pdf") as f:
    pdf = pdftotext.PDF(f)

# Iterate over all the pages
for page in pdf:
    print(page)



回答4:


PDFBox is a pretty good tool for extracting text from PDF files using Java. Text extraction is its strength; if you want to modify/annotate or view PDF files, another tool might serve you better. It has code for identifying spaces in files.

It also has code for handling ligatures, but you need to have a certain internationalization library on the classpath for that to work -- Icu4j.

You could call the PDFBox text extractor from Python as a command-line program, without writing any Java code.




回答5:


PyPDF doesnt read newline charecter.

So use PyPDF4

Install it using

pip install PyPF4

and use this code as an example

import PyPDF4
import re
import io

pdfFileObj = open(r'3134.pdf', 'rb')
pdfReader = PyPDF4.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(1)
pages_text = pageObj.extractText()

for line in pages_text.split('\n'):
    #if re.match(r"^PDF", line):
    print(line)



回答6:


I had solved this issue by using R:

library(pdftools)
pdf_file <- "xxx/untitled.pdf"
text <- pdf_text(pdf_file)
cat(text[1])


来源:https://stackoverflow.com/questions/11087795/whitespace-gone-from-pdf-extraction-and-strange-word-interpretation

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!