问题
I'm trying to do a find all from a Word document for <v:imagedata r:id="rId7" o:title="1-REN"/>
with namespace xmlns:v="urn:schemas-microsoft-com:vml"
and I cannot figure out what on earth the syntax is.
The docs only cover the very straight forward case and with the URN and VML combo thrown in I can't seem to get any of the examples I've seen online to work. Does anyone happen to know what it is?
I'm trying to do something like this:
namespace = {'v': "urn:schemas-microsoft-com:vml"}
results = ET.fromstring(xml).findall("imagedata", namespace)
for image_id in results:
print(image_id)
Edit: What @aneroid wrote is 1000% the right answer and super helpful. You should upvote it. That said, after understanding all that - I went with the BS4 answer because it does the entire job in two lines exactly how I need it to 😂. If you don't actually care about the namespaces it seems waaaaaaay easier.
回答1:
ET.findall()
vs BS4.find_all()
:
- ElementTree's findall() is not recursive by default*. It's only going to find direct children of the node provided. So in your case, it's only searching for image nodes directly under the root element.
- * as per mzjn's comment below, prefixing the
match
argument (tag or path) with".//"
will search for that node anywhere in the tree, since it's supports XPath's.
- * as per mzjn's comment below, prefixing the
- BeautifulSoup's find_all() searches all descendants. So it seaches for 'imagedata' nodes anywhere in the tree.
However, ElementTree.iter() does search all descendants. Using the 'working with namespaces' example in the docs:
>>> for char in root.iter('{http://characters.example.com}character'): ... print(' |-->', char.text) ... |--> Lancelot |--> Archie Leach |--> Sir Robin |--> Gunther |--> Commander Clement
- Sadly, ET.iterfind() which works with namespaces as a dict (like ET.findall), also does not search descendants, only direct children by default*. Just like ET.findall. Apart from how empty strings
''
in the tags are treated wrt the namespace, and one returns a list while the other returns an iterator, I can't say there's a meaningful difference betweenET.findall
andET.iterfind
.- * As above for
ET.findall()
, prefixing".//"
makes it search the entire tree (matches with any node).
- * As above for
When you use the namespaces with ET, you still need the namespace name with the tag. The results line should be:
namespace = {'v': "urn:schemas-microsoft-com:vml"}
results = ET.fromstring(xml).findall("v:imagedata", namespace) # note the 'v:'
Also, the 'v'
doesn't need to be a 'v'
, you could change it to something more meaningful if needed:
namespace = {'image': "urn:schemas-microsoft-com:vml"}
results = ET.fromstring(xml).findall("image:imagedata", namespace)
Of course, this still won't necessarily get you all the imagedata elements if they aren't direct children of the root. For that, you'd need to create a recursive function to do it for you. See this answer on SO for how. Note, while that answer does a recursive search, you are likely to hit Python's recursion limit if the descendant depth is too...deep.
To get all the imagedata elements anywhere in the tree, use the ".//"
prefix:
results = ET.fromstring(xml).findall(".//v:imagedata", namespace)
回答2:
I'm going to leave the question open, but the workaround I'm currently using is to use BeautifulSoup which happily accepts the v:
syntax.
soup = BeautifulSoup(xml, "lxml")
results = soup.find_all("v:imagedata")
回答3:
With ElementTree in Python 3.8, you can simply use a wildcard ({*}
) for the namespace:
results = ET.fromstring(xml).findall(".//{*}imagedata")
Note the .//
part, which means that the whole document (all descendants) is searched.
来源:https://stackoverflow.com/questions/62110439/how-to-use-python-xml-findall-to-find-vimagedata-rid-rid7-otitle-1-ren