Which characters make a URL invalid?
Are these valid URLs?
example.com/file[/].html
http://example.com/file[/].html
<
Several of Unicode character ranges are valid HTML5, although it might still not be a good idea to use them.
E.g., href
docs say http://www.w3.org/TR/html5/links.html#attr-hyperlink-href:
The href attribute on a and area elements must have a value that is a valid URL potentially surrounded by spaces.
Then the definition of "valid URL" points to http://url.spec.whatwg.org/, which says it aims to:
Align RFC 3986 and RFC 3987 with contemporary implementations and obsolete them in the process.
That document defines URL code points as:
ASCII alphanumeric, "!", "$", "&", "'", "(", ")", "*", "+", ",", "-", ".", "/", ":", ";", "=", "?", "@", "_", "~", and code points in the ranges U+00A0 to U+D7FF, U+E000 to U+FDCF, U+FDF0 to U+FFFD, U+10000 to U+1FFFD, U+20000 to U+2FFFD, U+30000 to U+3FFFD, U+40000 to U+4FFFD, U+50000 to U+5FFFD, U+60000 to U+6FFFD, U+70000 to U+7FFFD, U+80000 to U+8FFFD, U+90000 to U+9FFFD, U+A0000 to U+AFFFD, U+B0000 to U+BFFFD, U+C0000 to U+CFFFD, U+D0000 to U+DFFFD, U+E1000 to U+EFFFD, U+F0000 to U+FFFFD, U+100000 to U+10FFFD.
The term "URL code points" is then used in the statement:
If c is not a URL code point and not "%", parse error.
in a several parts of the parsing algorithm, including the schema, authority, relative path, query and fragment states: so basically the entire URL.
Also, the validator http://validator.w3.org/ passes for URLs like "你好"
, and does not pass for URLs with characters like spaces "a b"
Of course, as mentioned by Stephen C, it is not just about characters but also about context: you have to understand the entire algorithm. But since class "URL code points" is used on key points of the algorithm, it that gives a good idea of what you can use or not.
See also: Unicode characters in URLs
Not really an answer to your question but validating url's is really a serious p.i.t.a You're probably just better off validating the domainname and leave query part of the url be. That is my experience. You could also resort to pinging the url and seeing if it results in a valid response but that might be too much for such a simple task.
Regular expressions to detect url's are abundant, google it :)
I need to select character to split urls in string, so I decided to create list of characters which could not be found in URL by myself:
>>> allowed = "-_.~!*'();:@&=+$,/?%#[]?@ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789"
>>> from string import printable
>>> ''.join(set(printable).difference(set(allowed)))
'`" <\x0b\n\r\x0c\\\t{^}|>'
So, the possible choices are the newline, tab, space, backslash and "<>{}^|
. I guess I'll go with the space or newline. :)
To add some clarification and directly address the question above, there are several classes of characters that cause problems for URLs and URIs.
There are some characters that are disallowed and should never appear in a URL/URI, reserved characters (described below), and other characters that may cause problems in some cases, but are marked as "unwise" or "unsafe". Explanations for why the characters are restricted are clearly spelled out in RFC-1738 (URLs) and RFC-2396 (URIs). Note the newer RFC-3986 (update to RFC-1738) defines the construction of what characters are allowed in a given context but the older spec offers a simpler and more general description of which characters are not allowed with the following rules.
Excluded US-ASCII Characters disallowed within the URI syntax:
control = <US-ASCII coded characters 00-1F and 7F hexadecimal>
space = <US-ASCII coded character 20 hexadecimal>
delims = "<" | ">" | "#" | "%" | <">
The character "#" is excluded because it is used to delimit a URI from a fragment identifier. The percent character "%" is excluded because it is used for the encoding of escaped characters. In other words, the "#" and "%" are reserved characters that must be used in a specific context.
List of unwise characters are allowed but may cause problems:
unwise = "{" | "}" | "|" | "\" | "^" | "[" | "]" | "`"
Characters that are reserved within a query component and/or have special meaning within a URI/URL:
reserved = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" | "$" | ","
The "reserved" syntax class above refers to those characters that are allowed within a URI, but which may not be allowed within a particular component of the generic URI syntax. Characters in the "reserved" set are not reserved in all contexts. The hostname, for example, can contain an optional username so it could be something like ftp://user@hostname/
where the '@' character has special meaning.
Here is an example of a URL that has invalid and unwise characters (e.g. '$', '[', ']') and should be properly encoded:
http://mw1.google.com/mw-earth-vectordb/kml-samples/gp/seattle/gigapxl/$[level]/r$[y]_c$[x].jpg
Some of the character restrictions for URIs/URLs are programming language dependent. For example, the '|' (0x7C) character although only marked as "unwise" in the URI spec will throw a URISyntaxException in the Java java.net.URI constructor so a URL like http://api.google.com/q?exp=a|b
is not allowed and must be encoded instead as http://api.google.com/q?exp=a%7Cb
if using Java with a URI object instance.
I am implementing old http (0.9, 1.0, 1.1) request and response reader/writer. Request URI is the most problematic place.
You can't just use RFC 1738, 2396 or 3986 as it is. There are many old HTTP clients and servers that allows more characters. So I've made research based on accidentally published webserver access logs: "GET URI HTTP/1.0" 200
.
I've found that the following non-standard characters are often used in URI:
\ { } < > | ` ^ "
These characters were described in RFC 1738 as unsafe.
If you want to be compatible with all old HTTP clients and servers - you have to allow these characters in request URI.
Please read more information about this research in oghttp-request-collector.
In your supplementary question you asked if www.example.com/file[/].html
is a valid URL.
That URL isn't valid because a URL is a type of URI and a valid URI must have a scheme like http:
(see RFC 3986).
If you meant to ask if http://www.example.com/file[/].html
is a valid URL then the answer is still no because the square bracket characters aren't valid there.
The square bracket characters are reserved for URLs in this format: http://[2001:db8:85a3::8a2e:370:7334]/foo/bar
(i.e. an IPv6 literal instead of a host name)
It's worth reading RFC 3986 carefully if you want to understand the issue fully.