Create (sane/safe) filename from any (unsafe) string

前端 未结 11 1206
醉酒成梦
醉酒成梦 2020-12-28 12:45

I want to create a sane/safe filename (i.e. somewhat readable, no \"strange\" characters, etc.) from some random Unicode string (mich might contain just anything).

(

相关标签:
11条回答
  • 2020-12-28 13:39

    No solutions here, only problems that you must consider:

    • what is your minimum maximum filename length? (e.g. DOS supporting only 8-11 characters; most OS don't support >256 characters)

    • what filenames are forbidden in some context? (Windows still doesn't support saving a file as CON.TXT -- see https://blogs.msdn.microsoft.com/oldnewthing/20031022-00/?p=42073)

    • remember that . and .. have specific meanings (current/parent directory) and are therefore unsafe.

    • is there a risk that filenames will collide -- either due to removal of characters or the same filename being used multiple times?

    Consider just hashing the data and using the hexdump of that as a filename?

    0 讨论(0)
  • 2020-12-28 13:41

    If you don't mind to import other packages, then werkzeug has a method for sanitizing strings:

    from werkzeug.utils import secure_filename
    
    secure_filename("hello.exe")
    'hello.exe'
    secure_filename("/../../.ssh")
    'ssh'
    secure_filename("DROP TABLE")
    'DROP_TABLE'
    
    #fork bomb on Linux
    secure_filename(": () {: |: &} ;:")
    ''
    
    #delete all system files on Windows
    secure_filename("del*.*")
    'del'
    

    https://pypi.org/project/Werkzeug/

    0 讨论(0)
  • 2020-12-28 13:42

    More or less what has been mentioned here with regexp, but in reverse (replace any NOT listed):

    >>> import re
    >>> filename = u"ad\nbla'{-+\)(ç1?"
    >>> re.sub(r'[^\w\d-]','_',filename)
    u'ad_bla__-_____1_'
    
    0 讨论(0)
  • 2020-12-28 13:42

    The problem with many of the solutions here is that only cover character substitutions but not other issues.

    Here is a comprehensive universal solution that should cover all the bases. It handles all types of issues for you, including (but not limited too) character substitution.

    Works in Windows, *nix, and almost every other file system. Allows printable characters only.

    import re
    
    def txt2filename(txt, chr_set='normal'):
        """Converts txt to a valid Windows/*nix filename with printable characters only.
    
        args:
            txt: The str to convert.
            chr_set: 'normal', 'universal', or 'inclusive'.
                'universal':    ' -.0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'
                'normal':       Every printable character exept those disallowed on Windows/*nix.
                'extended':     All 'normal' characters plus the extended character ASCII codes 128-255
        """
    
        FILLER = '-'
    
        # Step 1: Remove excluded characters.
        if chr_set == 'universal':
            # Lookups in a set are O(n) vs O(n * x) for a str.
            printables = set(' -.0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz')
        else:
            if chr_set == 'normal':
                max_chr = 127
            elif chr_set == 'extended':
                max_chr = 256
            else:
                raise ValueError(f'The chr_set argument may be normal, extended or universal; not {chr_set=}')
            EXCLUDED_CHRS = set(r'<>:"/\|?*')               # Illegal characters in Windows filenames.
            EXCLUDED_CHRS.update(chr(127))                  # DEL (non-printable).
            printables = set(chr(x)
                             for x in range(32, max_chr)
                             if chr(x) not in EXCLUDED_CHRS)
        result = ''.join(x if x in printables else FILLER   # Allow printable characters only.
                         for x in txt)
    
        # Step 2: Device names, '.', and '..' are invalid filenames in Windows.
        DEVICE_NAMES = 'CON,PRN,AUX,NUL,COM1,COM2,COM3,COM4,' \
                       'COM5,COM6,COM7,COM8,COM9,LPT1,LPT2,' \
                       'LPT3,LPT4,LPT5,LPT6,LPT7,LPT8,LPT9,' \
                       'CONIN$,CONOUT$,..,.'.split()        # This list is an O(n) operation.
        if result in DEVICE_NAMES:
            result = f'-{result}-'
    
        # Step 3: Maximum length of filename is 255 bytes in Windows and Linux (other *nix flavors may allow longer names).
        result = result[:255]
    
        # Step 4: Windows does not allow filenames to end with '.' or ' ' or begin with ' '.
        result = re.sub(r'^[. ]', FILLER, result)
        result = re.sub(r' $', FILLER, result)
    
        return result
    

    This solution needs no external libraries. It substitutes non-printable filenames too because they are not always simple to deal with.

    0 讨论(0)
  • 2020-12-28 13:44

    Python:

    "".join([c for c in filename if c.isalpha() or c.isdigit() or c==' ']).rstrip()
    

    this accepts Unicode characters but removes line breaks, etc.

    example:

    filename = u"ad\nbla'{-+\)(ç?"
    

    gives: adblaç

    edit str.isalnum() does alphanumeric on one step. – comment from queueoverflow below. danodonovan hinted on keeping a dot included.

        keepcharacters = (' ','.','_')
        "".join(c for c in filename if c.isalnum() or c in keepcharacters).rstrip()
    
    0 讨论(0)
提交回复
热议问题