extract keywords form images using python

问题

still learning python. I am currently working on a python code that will extracts metadata (usermade keywords) from images. I already tried Pillow AND exif but this excludes the user made tags or keywords. With applist, i successfully managed to extract the metafile including my keywords but when I tried to purse it with ElementTree to extract the parts of interest, I obtain only empty data.

My xml file look like this (after some manipulation):

<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="XMP Core 4.4.0">
   <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
      <rdf:Description rdf:about=""
            xmlns:dc="http://purl.org/dc/elements/1.1/">
         <dc:description>
            <rdf:Seq>
               <rdf:li xml:lang="x-default">South Carolina, Olivyana, Kumasi</rdf:li>
            </rdf:Seq>
         </dc:description>
         <dc:subject>
            <rdf:Bag>
               <rdf:li>Kumasi</rdf:li>
               <rdf:li>Summer 2016</rdf:li>
               <rdf:li>Charlestone</rdf:li>
               <rdf:li>SC</rdf:li>
               <rdf:li>Beach</rdf:li>
               <rdf:li>Olivjana</rdf:li>
            </rdf:Bag>
         </dc:subject>
         <dc:title>
            <rdf:Seq>
               <rdf:li xml:lang="x-default">P1050365</rdf:li>
            </rdf:Seq>
         </dc:title>
      </rdf:Description>
      <rdf:Description rdf:about=""
            xmlns:aux="http://ns.adobe.com/exif/1.0/aux/">
         <aux:SerialNumber>F360908190331</aux:SerialNumber>
      </rdf:Description>
   </rdf:RDF>
</x:xmpmeta>

My code looks like this:

import xml.etree.ElementTree as ET
from PIL import Image, ExifTags
with Image.open("myfile.jpg") as im:
    for segment, content in im.applist:
        marker, body = content.split(b'\x00', 1)
        if segment == 'APP1' and marker == b'http://ns.adobe.com/xap/1.0/':
            data = body.decode('"utf-8"')
print (data)

at this point it was't possible to pass this to the parser as there is an empty line returning an error:

tree = ET.parse(data)

ValueError: embedded null byte

so after removing it i saved the data in a xml file (the xml data above) and passed to the parser but obtaining no data:

tree = ET.parse('mytags.xml')
tags = tree.findall('xmpmeta/RDF/Description/subject/Bags')
print (type(tags))
print (len(tags))

<class 'list'>
0

Interestingly, it I used the tags in the form of the xml file (i.e. 'x:xmpmeta':), I receive the following error:

SyntaxError: prefix 'x' not found in prefix map

Thanks for your help.

Fabio

回答1:

Focusing only on your XML parsing not PIL metadata work, three issues are your problem:

You need to define the namespace prefixes when using findall which you can do with the namespaces arg. And then your xpath must include the prefixes.
When using findall do not include the root as that is the starting point but from its child downward.
There is no Bags local name with plural but only Bag and its length would be one. If you want its children, go one level deeper.

Consider adjusted script:

import xml.etree.ElementTree as ET

tree = ET.parse('mytags.xml')

nmspdict = {'x':'adobe:ns:meta/',            
            'rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#',
            'dc': 'http://purl.org/dc/elements/1.1/'}

tags = tree.findall('rdf:RDF/rdf:Description/dc:subject/rdf:Bag/rdf:li',
                    namespaces = nmspdict)

print (type(tags))
print (len(tags))

# <class 'list'>
# 6

for i in tags:
    print(i.text)
# Kumasi
# Summer 2016
# Charlestone
# SC
# Beach
# Olivjana

来源：https://stackoverflow.com/questions/42892405/extract-keywords-form-images-using-python

标签

python

xml

image

metafile