save url as a file name in python

后端未结

关注

 5  830

Firstly, I\'m pretty new in python, please leave a comment as well if you consider to down vote

I have a url such as

http://example.com/here/there/


                      
              相关标签:


      
      
        
          5条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  感情败类        
                
              
                            
                2021-01-06 07:50
              
            
            
                                                                       
May look into restricted charaters.

I would use a typical folder struture for this task. If you will use that with a lot of urls it will get somehow or other a mess. And you will run into filesystem performance issues or limits as well.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  陌清茗        
                
              
                            
                2021-01-06 07:54
              
            
            
                                                                       
This is a bad idea as you will hit 255 byte limit for filenames as urls tend to be very long and even longer when b64encoded!

You can compress and b64 encode but it won't get you very far:

from base64 import b64encode 
import zlib
import bz2
from urllib.parse import quote

def url_strategies(url):
    url = url.encode('utf8')
    print(url.decode())
    print(f'normal  : {len(url)}')
    print(f'quoted  : {len(quote(url, ""))}')
    b64url = b64encode(url)
    print(f'b64     : {len(b64url)}')
    url = b64encode(zlib.compress(b64url))
    print(f'b64+zlib: {len(url)}')
    url = b64encode(bz2.compress(b64url))
    print(f'b64+bz2: {len(url)}')


Here's an average url I've found on angel.co:


URL = 'https://angel.co/job_listings/browse_startups_table?startup_ids%5B%5D=972887&startup_ids%5B%5D=365478&startup_ids%5B%5D=185570&startup_ids%5B%5D=32624&startup_ids%5B%5D=134966&startup_ids%5B%5D=722477&startup_ids%5B%5D=914250&startup_ids%5B%5D=901853&startup_ids%5B%5D=637842&startup_ids%5B%5D=305240&tab=find&page=1'


And even with b64+zlib it doesn't fit into 255 limit:

normal  : 316
quoted  : 414
b64     : 424
b64+zlib: 304
b64+bz2 : 396


Even with the best strategy of zlib compression and b64encode you'd still be in trouble.

Proper Solution

Alternatively what you should do is hash the url and attach url as file attribute to the file:

import os
from hashlib import sha256

def save_file(url, content, char_limit=13):
    # hash url as sha256 13 character long filename
    hash = sha256(url.encode()).hexdigest()[:char_limit]
    filename = f'{hash}.html'
    # 93fb17b5fb81b.html
    with open(filename, 'w') as f:
        f.write(content)
    # set url attribute
    os.setxattr(filename, 'user.url', url.encode())


and then you can retrieve the url attribute:

print(os.getxattr(filename, 'user.url').decode())
'https://angel.co/job_listings/browse_startups_table?startup_ids%5B%5D=972887&startup_ids%5B%5D=365478&startup_ids%5B%5D=185570&startup_ids%5B%5D=32624&startup_ids%5B%5D=134966&startup_ids%5B%5D=722477&startup_ids%5B%5D=914250&startup_ids%5B%5D=901853&startup_ids%5B%5D=637842&startup_ids%5B%5D=305240&tab=find&page=1'


note: setxattr and getxattr require user. prefix in python

for file attributes in python see related issue here: https://stackoverflow.com/a/56399698/3737009
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  南笙        
                
              
                            
                2021-01-06 07:55
              
            
            
                                                                       
Using urllib.urlretrieve:

    import urllib

    testfile = urllib.URLopener()
    testfile.retrieve("http://example.com/here/there/index.html", "/tmp/index.txt")

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  广开言路        
                
              
                            
                2021-01-06 07:58
              
            
            
                                                                       
You have several problems. One of them is that Unix shell abbreviations (~) are not going to be auto-interpreted by Python as they are in Unix shells. 

The second is that you're not going to have good luck writing a file path in Unix that has embedded slashes. You will need to convert them to something else if you're going to have any luck of retrieving them later. You could do that with something as simple as response.url.replace('/','_'), but that will leave you with many other characters that are also potentially problematic. You may wish to "sanitize" all of them on one shot. For example:

import os
import urllib

def write_response(response, filedir='~'):
    filedir = os.path.expanduser(dir)
    filename = urllib.quote(response.url, '')
    filepath = os.path.join(filedir, filename)
    with open(filepath, "w") as f:
        f.write(response.body)


This uses os.path functions to clean up the file paths, and urllib.quote to sanitize the URL into something that could work for a file name. There is a corresponding unquote to reverse that process.

Finally, when you write to a file, you may need to tweak that a bit depending on what the responses are, and how you want them written. If you want them written in binary, you'll need "wb" not just "w" as the file mode. Or if it's text, it might need some sort of encoding first (e.g., to utf-8). It depends on what your responses are, and how they are encoded.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  轮回少年        
                
              
                            
                2021-01-06 07:59
              
            
            
                                                                       
You could use the reversible base64 encoding.

>>> import base64
>>> base64.b64encode('http://example.com/here/there/index.html')
'aHR0cDovL2V4YW1wbGUuY29tL2hlcmUvdGhlcmUvaW5kZXguaHRtbA=='
>>> base64.b64decode('aHR0cDovL2V4YW1wbGUuY29tL2hlcmUvdGhlcmUvaW5kZXguaHRtbA==')
'http://example.com/here/there/index.html'


or perhaps binascii

>>> binascii.hexlify(b'http://example.com/here/there/index.html')
'687474703a2f2f6578616d706c652e636f6d2f686572652f74686572652f696e6465782e68746d6c'
>>> binascii.unhexlify('687474703a2f2f6578616d706c652e636f6d2f686572652f74686572652f696e6465782e68746d6c')
'http://example.com/here/there/index.html'

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复