问题
I am writing a script to take scanned pdf files and convert them into lines of text to enter into a database. I use re.findall to get matches from a list of regular expressions to get certain values from the tesseract extracted strings. I am having trouble when a regular expression can't find a match I want it to return "Error." So I can see that there is a problem.
I have tried a handful of if/else statements but I can't seem to get any to notice the None value.
from wand.image import Image as Img
import ghostscript
from PIL import Image
import pytesseract
import re
import os
def get_text_from_pdf(pendingpdf,pendingimg):
with Img(filename=pendingpdf, resolution=300) as img:
img.compression_quality = 99
img.save(filename=pendingimg)
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract'
extractedtext = pytesseract.image_to_string(Image.open(pendingimg))
os.unlink(pendingimg)
return extractedtext
def get_results(vendor,extracted_string,results):
for v in vendor:
pattern = re.compile(v)
for match in re.findall(pattern,extracted_string):
if type(match) is str:
results.append(match)
else:
results.append("Error")
return results
pendingpdf = r'J:\TBHscan07022019090315001.pdf'
pendingimg = 'Test1.jpg'
aggind = ["^(\w+)(?:.+)\n+3600",
"Ticket: (nonsensewordstothrowerror)",
"Ticket: \d+\s([0-9|/]+)",
"Product: (\w+.+)\n",
"Quantity: ([\d\.]+)",
"Truck (\w+)"]
vendor = aggind
extracted_string = get_text_from_pdf(pendingpdf,pendingimg)
results = []
print(get_results(vendor,get_text_from_pdf(pendingpdf,pendingimg),results))
回答1:
You could do this in a single line:
results += re.findall(pattern, extracted_string) or ["Error"]
BTW, you get no benefit from compiling the pattern inside the vendor loop because you're only using it once.
Your function could also return the whole search result using a single list comprehension:
return [m for v in vendor for m in re.findall(v, extracted_string) or ["Error"]]
It is a bit weird that you would actually want to modify AND return the results list being passed as parameter. This may produce some unexpected side effects when you use the function.
Your "Error" flag may appear several times in the result list, and given that each pattern may return multiple matches, it will be hard to determine which pattern failed to find a value.
If you only want to signal an error when none of the vendor patterns match, you could use the or ["Error"]
trick on whole result:
return [m for v in vendor for m in re.findall(v, extracted_string)] or ["Error"]
回答2:
With such an approach for match in re.findall(pattern,extracted_string):
if re.findall(...)
won't find any matches - the for
loop won't even run.
Save the result of matching into a variable beforehand, then - check with condition:
...
matches = re.findall(pattern, extracted_string)
if not matches:
results.append("Error")
else:
for match in matches:
results.append(match)
Note, when iterating through results of re.findall(...)
the check if type(match) is str:
won't make sense as each matched item is a string anyway (otherwise - a more sophisticated analysis of string's content could have been implied).
回答3:
re.findall
returns an empty list when there are no matches. So it should be as simple as:
result = re.findall(my_pattern, my_text)
if result:
# Successful logic here
else:
return "Error"
回答4:
You have
for match in re.findall(pattern,extracted_string):
if type(match) is str:
results.append(match)
else:
results.append("Error")
but re.findall()
returns None
when it doesn't find anything, so
for match in re.findall(pattern,extracted_string):
won't enter because match is None
.
You need to check match is None
outside of the for
loop.
来源:https://stackoverflow.com/questions/56855558/how-to-return-a-string-if-a-re-findall-finds-no-match