Which characters make a URL invalid?

后端 未结 10 1269
小蘑菇
小蘑菇 2020-11-21 05:03

Which characters make a URL invalid?

Are these valid URLs?

  • example.com/file[/].html
  • http://example.com/file[/].html<
相关标签:
10条回答
  • 2020-11-21 05:50

    All valid characters that can be used in a URI (a URL is a type of URI) are defined in RFC 3986.

    All other characters can be used in a URL provided that they are "URL Encoded" first. This involves changing the invalid character for specific "codes" (usually in the form of the percent symbol (%) followed by a hexadecimal number).

    This link, HTML URL Encoding Reference, contains a list of the encodings for invalid characters.

    0 讨论(0)
  • 2020-11-21 05:53

    In general URIs as defined by RFC 3986 (see Section 2: Characters) may contain any of the following 84 characters:

    ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-._~:/?#[]@!$&'()*+,;=
    

    Note that this list doesn't state where in the URI these characters may occur.

    Any other character needs to be encoded with the percent-encoding (%hh). Each part of the URI has further restrictions about what characters need to be represented by an percent-encoded word.

    0 讨论(0)
  • 2020-11-21 05:55

    Most of the existing answers here are impractical because they totally ignore the real-world usage of addresses like:

    • https://en.wikipedia.org/wiki/Möbius_strip or
    • https://zh.wikipedia.org/wiki/Wikipedia:关于中文维基百科/en.

    First, a digression into terminology. What are these addresses? Are they valid URLs?

    Historically, the answer was "no". According to RFC 3986, from 2005, such addresses are not URIs (and therefore not URLs, since URLs are a type of URIs). Per the terminology of 2005 IETF standards, we should properly call them IRIs (Internationalized Resource Identifiers), as defined in RFC 3987, which are technically not URIs but can be converted to URIs simply by percent-encoding all non-ASCII characters in the IRI.

    Per modern spec, the answer is "yes". The WHATWG Living Standard simply classifies everything that would previously be called "URIs" or "IRIs" as "URLs". This aligns the specced terminology with how normal people who haven't read the spec use the word "URL", which was one of the spec's goals.

    What characters are allowed under the WHATWG Living Standard?

    Per this newer meaning of "URL", what characters are allowed? In many parts of the URL, such as the query string and path, we're allowed to use arbitrary "URL units", which are

    URL code points and percent-encoded bytes.

    What are "URL code points"?

    The URL code points are ASCII alphanumeric, U+0021 (!), U+0024 ($), U+0026 (&), U+0027 ('), U+0028 LEFT PARENTHESIS, U+0029 RIGHT PARENTHESIS, U+002A (*), U+002B (+), U+002C (,), U+002D (-), U+002E (.), U+002F (/), U+003A (:), U+003B (;), U+003D (=), U+003F (?), U+0040 (@), U+005F (_), U+007E (~), and code points in the range U+00A0 to U+10FFFD, inclusive, excluding surrogates and noncharacters.

    (Note that the list of "URL code points" doesn't include %, but that %s are allowed in "URL code units" if they're part of a percent-encoding sequence.)

    The only place I can spot where the spec permits the use of any character that's not in this set is in the host, where IPv6 addresses are enclosed in [ and ] characters. Everywhere else in the URL, either URL units are allowed or some even more restrictive set of characters.

    What characters were allowed under the old RFCs?

    For the sake of history, and since it's not explored fully elsewhere in the answers here, let's examine was allowed under the older pair of specs.

    First of all, we have two types of RFC 3986 reserved characters:

    • :/?#[]@, which are part of the generic syntax for a URI defined in RFC 3986
    • !$&'()*+,;=, which aren't part of the RFC's generic syntax, but are reserved for use as syntactic components of particular URI schemes. For instance, semicolons and commas are used as part of the syntax of data URIs, and & and = are used as part of the ubiquitous ?foo=bar&qux=baz format in query strings (which isn't specified by RFC 3986).

    Any of the reserved characters above can be legally used in a URI without encoding, either to serve their syntactic purpose or just as literal characters in data in some places where such use could not be misinterpreted as the character serving its syntactic purpose. (For example, although / has syntactic meaning in a URL, you can use it unencoded in a query string, because it doesn't have meaning in a query string.)

    RFC 3986 also specifies some unreserved characters, which can always be used simply to represent data without any encoding:

    • abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789-._~

    Finally, the % character itself is allowed for percent-encodings.

    That leaves only the following ASCII characters that are forbidden from appearing in a URL:

    • The control characters (chars 0-1F and 7F), including new line, tab, and carriage return.
    • "<>\^`{|}

    Every other character from ASCII can legally feature in a URL.

    Then RFC 3987 extends that set of unreserved characters with the following unicode character ranges:

      %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF
    / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD
    / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD
    / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD
    / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD
    / %xD0000-DFFFD / %xE1000-EFFFD
    

    These block choices from the old spec seem bizarre and arbitrary given the latest Unicode block definitions; this is probably because the blocks have been added to in the decade since RFC 3987 was written.


    Finally, it's perhaps worth noting that simply knowing which characters can legally appear in a URL isn't sufficient to recognise whether some given string is a legal URL or not, since some characters are only legal in particular parts of the URL. For example, the reserved characters [ and ] are legal as part of an IPv6 literal host in a URL like http://[1080::8:800:200C:417A]/foo but aren't legal in any other context, so the OP's example of http://example.com/file[/].html is illegal.

    0 讨论(0)
  • 2020-11-21 05:56

    I came up with a couple regular expressions for PHP that will convert urls in text to anchor tags. (First it converts all www. urls to http:// then converts all urls with https?:// to a href=... html links

    $string = preg_replace('/(https?:\/\/)([!#$&-;=?\-\[\]_a-z~%]+)/sim', '<a href="$1$2">$2</a>', preg_replace('/(\s)((www\.)([!#$&-;=?\-\[\]_a-z~%]+))/sim', '$1http://$2', $string) );

    0 讨论(0)
提交回复
热议问题