How to parse data-uri in python?

后端未结

关注

 6  1217

HTML image elements have this simplified format:

That something can be data-uri, for example:


                      
              相关标签:


      
      
        
          6条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  太阳男子        
                
              
                            
                2020-12-04 02:19
              
            
            
                                                                       
w3lib (a library used by Scrapy) has a function to parse data uris:

>>> from w3lib.url import parse_data_uri
>>> parse_data_uri('data:image/png;base64,iVBORw0KGg==')
ParseDataURIResult(media_type='image/png', media_type_parameters={}, data=b'\x89PNG\r\n\x1a')

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  后悔当初        
                
              
                            
                2020-12-04 02:20
              
            
            
                                                                       
from urllib import request

def download(data_uri,name):

    with request.urlopen(data_uri) as response:
         data = response.read()

    with open(name, "wb") as f:
        f.write(data)

en="https://encrypted-tbn0.gstatic.com/images..."

src="data:image/png;base64,..."

download(en,"en")

download(src,"src")

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  青春惊慌失措        
                
              
                            
                2020-12-04 02:27
              
            
            
                                                                       
Python since 3.4 have support for data-uri. Under hood using urllib.request.DataHandler.

from urllib.request import urlopen

with urlopen(data_uri) as response:
    data = response.read()

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  不知归路        
                
              
                            
                2020-12-04 02:29
              
            
            
                                                                       
Correcting JRodDynamite's post:

from base64 import decodestring

png_arr= "data:image/png;base64,iVBORw0KGg..."
png_arr = png_arr.split(",")
png_arr = png_arr[1]

fh = open("imageToSave.png", "wb")
fh.write(decodestring(png_arr))
fh.close()

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  误落风尘        
                
              
                            
                2020-12-04 02:38
              
            
            
                                                                       
Split the data URI on the comma to get the base64 encoded data without the header. Call base64.b64decode to decode that to bytes. Last, write the bytes to a file.
from base64 import b64decode

data_uri = "data:image/png;base64,iVBORw0KGg..."

# Python 2 and <Python 3.4
header, encoded = data_uri.split(",", 1)
data = b64decode(encoded)

# Python 3.4+
# from urllib import request
# with request.urlopen(data_uri) as response:
#     data = response.read()

with open("image.png", "wb") as f:
    f.write(data)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  不要未来只要你来        
                
              
                            
                2020-12-04 02:43
              
            
            
                                                                       
This may help:

import re
from lxml import html

BASE_NAME = "image_"

source_code = """<img src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAUA
AAAFCAYAAACNbyblAAAAHElEQVQI12P4//8/w38GIAXDIBKE0DHxgljNBAAO
9TXL0Y4OHwAAAABJRU5ErkJggg==" alt="Red dot" />
<img src="data:image/gif;base64,R0lGODlhAQABAIAAAAUEBAAAACwAAAAAAQABAAACAkQBADs=" alt="Black dot" />"""

tree = html.fromstring(source_code)

for i,image in enumerate(tree.xpath('//img[contains(@src, "data:image")]/@src')):
    image_type, image_content = image.split(',', 1)
    image_type = re.findall('data:image\/(\w+);base64', image_type)[0]
    with open("{}{}.{}".format(BASE_NAME, i, image_type), "wb") as f:
        f.write(image_content.decode('base64'))
    print "[*] '{}' image found with content: {}\n".format(image_type, image_content)


Output:

[*] 'png' image found with content: iVBORw0KGgoAAAANSUhEUgAAAAUA
AAAFCAYAAACNbyblAAAAHElEQVQI12P4//8/w38GIAXDIBKE0DHxgljNBAAO
9TXL0Y4OHwAAAABJRU5ErkJggg==

[*] 'gif' image found with content: R0lGODlhAQABAIAAAAUEBAAAACwAAAAAAQABAAACAkQBADs=


It will save every base64 image within <img> tags, with their respective file extension:

Prefixed by BASE_NAME + auto-increment digit(s) provided by enumerate + image_extension


                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复