How to parse data-uri in python?

后端 未结 6 1215
逝去的感伤
逝去的感伤 2020-12-04 01:59

HTML image elements have this simplified format:


That something can be data-uri, for example:



        
相关标签:
6条回答
  • 2020-12-04 02:19

    w3lib (a library used by Scrapy) has a function to parse data uris:

    >>> from w3lib.url import parse_data_uri
    >>> parse_data_uri('')
    ParseDataURIResult(media_type='image/png', media_type_parameters={}, data=b'\x89PNG\r\n\x1a')
    
    0 讨论(0)
  • 2020-12-04 02:20
    from urllib import request
    
    def download(data_uri,name):
    
        with request.urlopen(data_uri) as response:
             data = response.read()
    
        with open(name, "wb") as f:
            f.write(data)
    
    en="https://encrypted-tbn0.gstatic.com/images..."
    
    src="data:image/png;base64,..."
    
    download(en,"en")
    
    download(src,"src")
    
    0 讨论(0)
  • 2020-12-04 02:27

    Python since 3.4 have support for data-uri. Under hood using urllib.request.DataHandler.

    from urllib.request import urlopen
    
    with urlopen(data_uri) as response:
        data = response.read()
    
    0 讨论(0)
  • 2020-12-04 02:29

    Correcting JRodDynamite's post:

    from base64 import decodestring
    
    png_arr= "..."
    png_arr = png_arr.split(",")
    png_arr = png_arr[1]
    
    fh = open("imageToSave.png", "wb")
    fh.write(decodestring(png_arr))
    fh.close()
    
    0 讨论(0)
  • 2020-12-04 02:38

    Split the data URI on the comma to get the base64 encoded data without the header. Call base64.b64decode to decode that to bytes. Last, write the bytes to a file.

    from base64 import b64decode
    
    data_uri = "..."
    
    # Python 2 and <Python 3.4
    header, encoded = data_uri.split(",", 1)
    data = b64decode(encoded)
    
    # Python 3.4+
    # from urllib import request
    # with request.urlopen(data_uri) as response:
    #     data = response.read()
    
    with open("image.png", "wb") as f:
        f.write(data)
    
    0 讨论(0)
  • This may help:

    import re
    from lxml import html
    
    BASE_NAME = "image_"
    
    source_code = """<img src="
    AAAFCAYAAACNbyblAAAAHElEQVQI12P4//8/w38GIAXDIBKE0DHxgljNBAAO
    9TXL0Y4OHwAAAABJRU5ErkJggg==" alt="Red dot" />
    <img src="" alt="Black dot" />"""
    
    tree = html.fromstring(source_code)
    
    for i,image in enumerate(tree.xpath('//img[contains(@src, "data:image")]/@src')):
        image_type, image_content = image.split(',', 1)
        image_type = re.findall('data:image\/(\w+);base64', image_type)[0]
        with open("{}{}.{}".format(BASE_NAME, i, image_type), "wb") as f:
            f.write(image_content.decode('base64'))
        print "[*] '{}' image found with content: {}\n".format(image_type, image_content)
    

    Output:

    [*] 'png' image found with content: iVBORw0KGgoAAAANSUhEUgAAAAUA
    AAAFCAYAAACNbyblAAAAHElEQVQI12P4//8/w38GIAXDIBKE0DHxgljNBAAO
    9TXL0Y4OHwAAAABJRU5ErkJggg==
    
    [*] 'gif' image found with content: R0lGODlhAQABAIAAAAUEBAAAACwAAAAAAQABAAACAkQBADs=
    

    It will save every base64 image within <img> tags, with their respective file extension:

    Prefixed by BASE_NAME + auto-increment digit(s) provided by enumerate + image_extension

    0 讨论(0)
提交回复
热议问题