I want to create a sane/safe filename (i.e. somewhat readable, no \"strange\" characters, etc.) from some random Unicode string (mich might contain just anything).
(
No solutions here, only problems that you must consider:
what is your minimum maximum filename length? (e.g. DOS supporting only 8-11 characters; most OS don't support >256 characters)
what filenames are forbidden in some context? (Windows still doesn't support saving a file as CON.TXT
-- see https://blogs.msdn.microsoft.com/oldnewthing/20031022-00/?p=42073)
remember that .
and ..
have specific meanings (current/parent directory) and are therefore unsafe.
is there a risk that filenames will collide -- either due to removal of characters or the same filename being used multiple times?
Consider just hashing the data and using the hexdump of that as a filename?
If you don't mind to import other packages, then werkzeug has a method for sanitizing strings:
from werkzeug.utils import secure_filename
secure_filename("hello.exe")
'hello.exe'
secure_filename("/../../.ssh")
'ssh'
secure_filename("DROP TABLE")
'DROP_TABLE'
#fork bomb on Linux
secure_filename(": () {: |: &} ;:")
''
#delete all system files on Windows
secure_filename("del*.*")
'del'
https://pypi.org/project/Werkzeug/
More or less what has been mentioned here with regexp, but in reverse (replace any NOT listed):
>>> import re
>>> filename = u"ad\nbla'{-+\)(ç1?"
>>> re.sub(r'[^\w\d-]','_',filename)
u'ad_bla__-_____1_'
The problem with many of the solutions here is that only cover character substitutions but not other issues.
Here is a comprehensive universal solution that should cover all the bases. It handles all types of issues for you, including (but not limited too) character substitution.
Works in Windows, *nix, and almost every other file system. Allows printable characters only.
import re
def txt2filename(txt, chr_set='normal'):
"""Converts txt to a valid Windows/*nix filename with printable characters only.
args:
txt: The str to convert.
chr_set: 'normal', 'universal', or 'inclusive'.
'universal': ' -.0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'
'normal': Every printable character exept those disallowed on Windows/*nix.
'extended': All 'normal' characters plus the extended character ASCII codes 128-255
"""
FILLER = '-'
# Step 1: Remove excluded characters.
if chr_set == 'universal':
# Lookups in a set are O(n) vs O(n * x) for a str.
printables = set(' -.0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz')
else:
if chr_set == 'normal':
max_chr = 127
elif chr_set == 'extended':
max_chr = 256
else:
raise ValueError(f'The chr_set argument may be normal, extended or universal; not {chr_set=}')
EXCLUDED_CHRS = set(r'<>:"/\|?*') # Illegal characters in Windows filenames.
EXCLUDED_CHRS.update(chr(127)) # DEL (non-printable).
printables = set(chr(x)
for x in range(32, max_chr)
if chr(x) not in EXCLUDED_CHRS)
result = ''.join(x if x in printables else FILLER # Allow printable characters only.
for x in txt)
# Step 2: Device names, '.', and '..' are invalid filenames in Windows.
DEVICE_NAMES = 'CON,PRN,AUX,NUL,COM1,COM2,COM3,COM4,' \
'COM5,COM6,COM7,COM8,COM9,LPT1,LPT2,' \
'LPT3,LPT4,LPT5,LPT6,LPT7,LPT8,LPT9,' \
'CONIN$,CONOUT$,..,.'.split() # This list is an O(n) operation.
if result in DEVICE_NAMES:
result = f'-{result}-'
# Step 3: Maximum length of filename is 255 bytes in Windows and Linux (other *nix flavors may allow longer names).
result = result[:255]
# Step 4: Windows does not allow filenames to end with '.' or ' ' or begin with ' '.
result = re.sub(r'^[. ]', FILLER, result)
result = re.sub(r' $', FILLER, result)
return result
This solution needs no external libraries. It substitutes non-printable filenames too because they are not always simple to deal with.
Python:
"".join([c for c in filename if c.isalpha() or c.isdigit() or c==' ']).rstrip()
this accepts Unicode characters but removes line breaks, etc.
example:
filename = u"ad\nbla'{-+\)(ç?"
gives: adblaç
edit str.isalnum() does alphanumeric on one step. – comment from queueoverflow below. danodonovan hinted on keeping a dot included.
keepcharacters = (' ','.','_')
"".join(c for c in filename if c.isalnum() or c in keepcharacters).rstrip()