问题
Background
I'm using pdfquery to parse multiple files like this one.
Problem
I'm trying to write a generalized filer function, building off of the custom selectors mentioned in pdfquery's docs, that can take a specific range as an argument. Because this
is referenced I thought I could get around this by supplying a partial function using functools.partial
(as seen below)
Input
import pdfquery
import functools
def load_file(PDF_FILE):
pdf = pdfquery.PDFQuery(PDF_FILE)
pdf.load()
return pdf
file_with_table = 'Path to the file mentioned above'
pdf = load_file(file_with_table)
def elements_in_range(x1_range):
return in_range(x1_range[0], x1_range[1], float(this.get('x1',0)))
x1_part = functools.partial(elements_in_range, (95,350))
pdf.pq('LTPage[page_index="0"] *').filter(x1_part)
But when I do that I get the following attribute error;
Output
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
C:\Anaconda3\lib\site-packages\pyquery\pyquery.py in filter(self, selector)
597 if len(args) == 1:
--> 598 func_globals(selector)['this'] = this
599 if callback(selector, i, this):
C:\Anaconda3\lib\site-packages\pyquery\pyquery.py in func_globals(f)
28 def func_globals(f):
---> 29 return f.__globals__ if PY3k else f.func_globals
30
AttributeError: 'functools.partial' object has no attribute '__globals__'
During handling of the above exception, another exception occurred:
AttributeError Traceback (most recent call last)
<ipython-input-74-d75c2c19f74b> in <module>()
15 x1_part = functools.partial(elements_in_range, (95,350))
16
---> 17 pdf.pq('LTPage[page_index="0"] *').filter(x1_part)
C:\Anaconda3\lib\site-packages\pyquery\pyquery.py in filter(self, selector)
600 elements.append(this)
601 finally:
--> 602 f_globals = func_globals(selector)
603 if 'this' in f_globals:
604 del f_globals['this']
C:\Anaconda3\lib\site-packages\pyquery\pyquery.py in func_globals(f)
27
28 def func_globals(f):
---> 29 return f.__globals__ if PY3k else f.func_globals
30
31
AttributeError: 'functools.partial' object has no attribute '__globals__'
Is there any way to get around this? Or possibly some other way to write a custom selector for pdfquery that can take arguments?
回答1:
What about just using a function to return a new function (similar to a functools.partial in a way), but using a closure instead?
import pdfquery
def load_file(PDF_FILE):
pdf = pdfquery.PDFQuery(PDF_FILE)
pdf.load()
return pdf
file_with_table = './RG234621_90110.pdf'
pdf = load_file(file_with_table)
def in_range(x1, x2, sample):
return x1 <= sample <= x2
def in_x_range(bounds):
def wrapped(*args, **kwargs):
x = float(this.get('x1', 0))
return in_range(bounds[0], bounds[1], x)
return wrapped
def in_y_range(bounds):
def wrapped(*args, **kwargs):
y = float(this.get('y1', 0))
return in_range(bounds[0], bounds[1], y)
return wrapped
print(len(pdf.pq('LTPage[page_index="0"] *').filter(in_x_range((95, 350))).filter(in_y_range((60, 100)))))
# Or, perhaps easier to read
x_check = in_x_range((95, 350))
y_check = in_y_range((60, 100))
print(len(pdf.pq('LTPage[page_index="0"] *').filter(x_check).filter(y_check)))
OUTPUT
1
16 # <-- bounds check is larger for y in this test
You could event parameterize the property you are comparing
import pdfquery
def load_file(PDF_FILE):
pdf = pdfquery.PDFQuery(PDF_FILE)
pdf.load()
return pdf
file_with_table = './RG234621_90110.pdf'
pdf = load_file(file_with_table)
def in_range(prop, bounds):
def wrapped(*args, **kwargs):
n = float(this.get(prop, 0))
return bounds[0] <= n <= bounds[1]
return wrapped
print(len(pdf.pq('LTPage[page_index="0"] *').filter(in_range('x1', (95, 350))).filter(in_range('y1', (60, 100)))))
x_check = in_range('x1', (95, 350))
y_check = in_range('y1', (40, 100))
print(len(pdf.pq('LTPage[page_index="0"] *').filter(x_check).filter(y_check)))
I would also suggest the use of the parse_tree_cacher
argument as that sped up the time for me to find an appropriate solution (though you may not need to reprocess frequently as I did while figuring this out).
import pdfquery
from pdfquery.cache import FileCache
def load_file(PDF_FILE):
pdf = pdfquery.PDFQuery(PDF_FILE, parse_tree_cacher=FileCache("/tmp/"))
pdf.load()
return pdf
回答2:
Although, I like the closure approach, I really should mention that you can copy attributes from your wrapped function to your wrapper.
from functools import update_wrapper
custom_filter = update_wrapper(
partial(
elements_in_range, (95, 20)
),
wrapped=elements_in_range,
assigned=('__globals__', '__code__')
)
来源:https://stackoverflow.com/questions/45868809/using-functools-partial-to-make-custom-filters-for-pdfquery-getting-attribute-er