问题
I am trying to extract this text:
DLA LAND AND MARITIME
ACTIVE DEVICES DIVISION
PO BOX 3990
COLUMBUS OH 43218-3990
USA
Name: Desmond Forshey Buyer Code:PMCMTA9 Tel: 614-692-6154 Fax: 614-692-6930
Email: Desmond.Forshey@dla.mil
from this pdf file. I was able to extract some text between two references using the code below:
import PyPDF2
pdfFileObj = open('SPE7M518T446E.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
print(pdfReader.numPages)
pageObj1 = pdfReader.getPage(0)
pagecontent = pageObj1.extractText()
def between(value, a, b):
# Find and validate before-part.
pos_a = value.find(a)
if pos_a == -1: return ""
# Find and validate after part.
pos_b = value.rfind(b)
if pos_b == -1: return ""
# Return middle part.
adjusted_pos_a = pos_a + len(a)
if adjusted_pos_a >= pos_b: return ""
return value[adjusted_pos_a:pos_b]
desired = between(pagecontent,"5. ","8. ")
print(desired)
The code above outputs this:
20
REQUEST FOR QUOTATIONSTHIS RFQ IS IS NOT A SMALL BUSINESS SET-ASIDE 4. CERT.FOR NAT. DEF. UNDER BDSA REG. 2 AND/OR DMS REG. 15. ISSUED BY7. DELIVERY 9. DESTINATION10. PLEASE FURNISH QUOTATIONS TO THE ISSUING OFFICE IN BLOCK 5 ON OR BEFORE CLOSE OF BUSINESS (Date)IMPORTANT: This is a request for information, and quotations furnished are not offers. If you are unable to quote, please so indicate on this form and return it to the address in Block 5. This request does not commit the Government to pay any costs incurred in the preparation of the submission of this quotation or to contract for supplies or services. Supplies are of domestic origin unless otherwise indicated by quoter. Any representations and/or certifications attached to this Request for Quotations must be completed by the quoter.11. SCHEDULE (See Continuation Sheets) 12. DISCOUNT FOR PROMPT PAYMENTd. CALENDAR DAYSNUMBERPERCENTAGE NOTE: Additional provisions and representations are are not attached.13. NAME AND ADDRESS OF QUOTERa. NAME OF QUOTER16. SIGNERAUTHORIZED FOR LOCAL REPRODUCTION Previous edition not useableSTANDARD FORM 18 (REV. 6-95) Prescribed by GSA-FAR (48 CFR) 53.215-1(a) SPE7M5-18-T-446E1. REQUEST NO.2018 APR 302. DATE ISSUED00739229623. REQUISITION/PURCHASE REQUEST NO.DO-C9RATINGDLA LAND AND MARITIME
ACTIVE DEVICES DIVISION
PO BOX 3990
COLUMBUS OH 43218-3990
USA
Name: Desmond Forshey Buyer Code:PMCMTA9 Tel: 614-692-6154 Fax: 614-692-6930
Email: Desmond.Forshey@dla.mil175 DAYS ADO 6. DELIVER BY (Date)8. TO: c. CITYd. STATE b. STREET ADDRESS a. NAME OF CONSIGNEEe. ZIP CODE a. 10 CALENDAR DAYS (%)b. 20 CALENDAR DAYS (%) c. 30 CALENDAR DAYS (%)15. Date of Quotationa. NAME (Type or Print)
AREA CODEc. TITLE (Type or Print)d. CITY c. COUNTY b. STREET ADDRESSe. STATE f. ZIP CODESee Schedule2018 MAY 10NUMBERFOB DESTINATIONOTHER (See Schedule)CAGE b. TELEPHONE PAGE OF PAGES1
POC INFORMATION:
WHEN TECHNICAL DATA IS PROVIDED IT MUST BE OBTAINED AT:https://pcf1x.bsm.dla.mil/cfolders. DISCREPANCIES FOUND IN TECHNICAL DATA SHOULD SUBMIT
REQUEST TO THE DLA CUSTOMER SERVICE WEBSITE:https://www.pdmd.dla.mil/cs/
ALL OTHER QUESTIONS (SOLICITATION REQUIREMENTS, ITEM DESCRIPTION, AWARD CHOICE, ETC.), PLEASE CONTACT THE BUYER SHOWN ABOVE.
QUESTIONS REGARDING OPERATION OF THE DLA-BSM INTERNET BID BOARD SYSTEM SHOULD BE E-MAILED TO: DibbsBSM@dla.mil
FOR IMMEDIATE ASSISTANCE, PLEASE REFER TO THE FREQUENTLY ASKED QUESTIONS (FAQS) ON BSM DIBBS AT:
https://www.dibbs.bsm.dla.mil/Refs/help/DIBBSHelp.htm OR PHONE 1-855-DLA-0001 (1-855-352-0001).
MASTER SOLICITATION
THIS SOLICITATION INCORPORATES THE TERMS AND CONDITIONS SET FORTH IN THE DLA MASTER SOLICITATION FOR AUTOMATED SIMPLIFIED
ACQUISITIONS REVISION 46 (FEBRURARY 7, 2018) WHICH CAN BE FOUND ON THE WEB AT:
http://www.dla.mil/Portals/104/Documents/J7Acquisition/Master%20Solicitation%20Rev-46%20February-7-2018.pdf?ver=2018-02-08-063754-70
This solicitation incorporates technical/quality requirements (‚R™ or ‚I™ number in section B). The full text is in the DLA Technical and Quality Master List of Requirements at:
http://www.dla.mil/HQ/Acquisition/Offers/eprocurement.aspx The revisionof the TQ Master in effect on the award date controls.14. SIGNATURE OF PERSON AUTHORIZED TO SIGN QUOTATION 1 20
###################
ISSUED BY7. DELIVERY 9. DESTINATION10. PLEASE FURNISH QUOTATIONS TO THE ISSUING OFFICE IN BLOCK 5 ON OR BEFORE CLOSE OF BUSINESS (Date)IMPORTANT: This is a request for information, and quotations furnished are not offers. If you are unable to quote, please so indicate on this form and return it to the address in Block 5. This request does not commit the Government to pay any costs incurred in the preparation of the submission of this quotation or to contract for supplies or services. Supplies are of domestic origin unless otherwise indicated by quoter. Any representations and/or certifications attached to this Request for Quotations must be completed by the quoter.11. SCHEDULE (See Continuation Sheets) 12. DISCOUNT FOR PROMPT PAYMENTd. CALENDAR DAYSNUMBERPERCENTAGE NOTE: Additional provisions and representations are are not attached.13. NAME AND ADDRESS OF QUOTERa. NAME OF QUOTER16. SIGNERAUTHORIZED FOR LOCAL REPRODUCTION Previous edition not useableSTANDARD FORM 18 (REV. 6-95) Prescribed by GSA-FAR (48 CFR) 53.215-1(a) SPE7M5-18-T-446E1. REQUEST NO.2018 APR 302. DATE ISSUED00739229623. REQUISITION/PURCHASE REQUEST NO.DO-C9RATINGDLA LAND AND MARITIME
ACTIVE DEVICES DIVISION
PO BOX 3990
COLUMBUS OH 43218-3990
USA
Name: Desmond Forshey Buyer Code:PMCMTA9 Tel: 614-692-6154 Fax: 614-692-6930
Email: Desmond.Forshey@dla.mil175 DAYS ADO 6. DELIVER BY (Date)
How can I extract the text below from the PDF file ?
DLA LAND AND MARITIME
ACTIVE DEVICES DIVISION
PO BOX 3990
COLUMBUS OH 43218-3990
USA
Name: Desmond Forshey Buyer Code:PMCMTA9 Tel: 614-692-6154 Fax: 614-692-6930
Email: Desmond.Forshey@dla.mil
回答1:
That PDF reader doesn't give much scope for interacting with the structure of the returned data. It is though possible to add a new function to it that returns each element as another item in a list. You would then at least be able to extract the data between two items. The approach is still not foolproof as you still need to decide on possible termination cases:
import PyPDF2
import itertools
def extractTextList(self):
text_list = []
content = self["/Contents"].getObject()
if not isinstance(content, ContentStream):
content = ContentStream(content, self.pdf)
for operands, operator in content.operations:
if operator == b_("Tj"):
_text = operands[0]
if isinstance(_text, TextStringObject) and len(_text.strip()):
text_list.append(_text.strip())
elif operator == b_("T*"):
pass
elif operator == b_("'"):
pass
_text = operands[0]
if isinstance(_text, TextStringObject) and len(operands[0]):
text_list.append(operands[0])
elif operator == b_('"'):
_text = operands[2]
if isinstance(_text, TextStringObject) and len(_text):
text_list.append(_text)
elif operator == b_("TJ"):
for i in operands[0]:
if isinstance(i, TextStringObject) and len(i):
text_list.append(i)
return text_list
from PyPDF2.pdf import PageObject, u_, ContentStream, b_, TextStringObject
PageObject.extractTextList = extractTextList
def between(text_elements, drop_while, take_while):
return list(itertools.takewhile(take_while, itertools.dropwhile(drop_while, text_elements)))[1:]
pdfFileObj = open('SPE7M518T446E.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
page0 = pdfReader.getPage(0)
text_elements = page0.extractTextList()
lines = between(text_elements, lambda x: x != 'RATING', lambda x: 'DAYS' not in x)
print('\n'.join(lines))
This would give you the lines you want, which are then combined into a single output as follows:
DLA LAND AND MARITIME
ACTIVE DEVICES DIVISION
PO BOX 3990
COLUMBUS OH 43218-3990
USA
Name: Desmond Forshey Buyer Code:PMCMTA9 Tel: 614-692-6154 Fax: 614-692-6930
Email: Desmond.Forshey@dla.mil
As the new function extractTextList()
returns a list of text elements found in the page, I use itertools.dropwhile() and itertools.takewhile() to process the returned list.
The between()
function works in two stages, first it reads the list of strings one at a time and discards them until it matches this first test (which is to find RATING
). It then starts returning elements to the takewhile()
function. This keeps taking elements until it spots the word DAYS
in one of the elements. list()
is used to create the filtered list. I then drop the first element (as it is the word RATING
).
In effect this is an iterative way of doing a slice on the list.
Note: lambda
is just another way of defining a function. In this case it takes a text element called x
and returns True
if it is a certain value, or for the takewhile, if the word DAYS
is somewhere inside it. The two itertool functions call these lambda functions for each element in the list.
来源:https://stackoverflow.com/questions/50116318/how-extract-extract-specific-text-from-pdf-file-python