问题
lines = []
total_check = 0
with pdfplumber.open(file) as pdf:
pages = pdf.pages
for page in pdf.pages:
text = page.extract_text()
for line in text.split('\n'):
print(line)
output data:
Totaalbedrag excl. btw € 25,00
When I try to retrieve VAT from data:
KVK_re = re.compile(r'(excl. btw .+)')
KVK_re.search(data).group(0)
output: AttributeError: 'NoneType' object has no attribute 'group'
KVK_re = re.compile(r'(excl. btw .+)')
KVK_re.search(r'excl. btw € 25,00').group(0)
output: 'excl. btw € 25,00'
How is it possible that when I paste the literal output in a search it does find the number € 25,00 and when I enter the data variable it does not?
Please help me!
回答1:
In most cases, when a literal space is used in the pattern and there is no match, the reason is the invisible characters, or non-breaking spaces.
When you have non-breaking spaces, \xA0
, you can simply replace the literal spaces with \s
to match any whitespace, or [ \xA0]
to match either of the spaces.
It appears there may be a combination of both spaces and some invisible chars in this case, thus, you may use \W
to match any non-word chars instead of a literal space:
r'excl\.\W+btw\W.+'
回答2:
You didn't provide what the contents of the data
object are, but the error message is just saying that the regex is not found. So you're probably calling search on data that doesn't contain that specific string.
$ KVK_re = re.compile(r'(excl. btw .+)')
$ KVK_re.search('test').group(0)
AttributeError: 'NoneType' object has no attribute 'group'
来源:https://stackoverflow.com/questions/64607452/why-cant-i-find-this-string-in-regex