What is the most correct regular expression (regex) for a UNIX file path?
For example, to detect something like this:
/usr/lib/libgccpp.so.1.0.2
If you don't mind false positives for identifying paths, then you really just need to ensure the path doesn't contain a NUL
character; everything else is permitted (in particular, /
is the name-separator character). The better approach would be to resolve the given path using the appropriate file IO function (e.g. File.exists(), File.getCanonicalFile() in Java).
Long answer:
This is both operating system and file system dependent. For example, the Wikipedia comparison of file systems notes that besides the limits imposed by the file system,
MS-DOS, Microsoft Windows, and OS/2 disallow the characters
\ / : ? * " > < |
andNUL
in file and directory names across all filesystems. Unices and Linux disallow the characters/
andNUL
in file and directory names across all filesystems.
In Windows, the following reserved device names are also not permitted as filenames:
CON, PRN, AUX, NUL, COM1, COM2, COM3, COM4, COM5,
COM6, COM7, COM8, COM9, LPT1, LPT2, LPT3, LPT4,
LPT5, LPT6, LPT7, LPT8, LPT9
I'm not sure how common a regex check for this is across systems, but most programming languages (especially the cross platform ones) provide a "file exists" check which will take this kind of thing into account
Out of curiosity, where are these paths being input? Could you control that to a greater degress to the point where you won't have to check the individual pieces of the path? For example using a file chooser dialog?
^(/)?([^/\0]+(/)?)+$
This will accept every path that is legal in filesystems such as extX, reiserfs.
It discards only the path names containing the NUL or double (or more) slashes. Everything else according to Unix spec should be legal (I'm suprised with this outcome too).
Question already answered here: https://stackoverflow.com/a/42036026/1951947
To others who have answered this question, it's important to note that some applications would require a slightly different regex, depending on how escape characters work in the program you're writing. If you were writing a shell, for example, and wanted to have command separated by spaces and other special characters, you would have to modify your regex to only include words with special characters if those characters are escaped.
So, for example, a valid path would be
/usr/bin/program\ with\ space
as opposed to
/usr/bin/program with space
which would refer to "/usr/bin/program" with arguments "with" and "space"
A regex for the above example could be "([^\0 ]\|\\ )*"
The regex that I've been working on is (newline separated for 'readability'):
"\( # Either [^\0 !$`&*()+] # A normal (non-special) character \| # Or \\\(\ |\!|\$|\`|\&|\*|\(|\)|\+\) # An escaped special character \)\+" # Repeated >= 1 times
Which translates to
"\([^\0 !$`&*()+]\|\\\(\ |\!|\$|\`|\&|\*|\(|\)|\+\)\)\+"
Creating your own specific regex should be relatively simple, as well.
The proper regular expression to match all UNIX paths is: [^\0]+
That is, one or more characters that are not a NUL.