I am working with apache http client 4 for all of my web accesses. This means that every query that I need to do has to pass the URI syntax checks. One of the sites that I am tr
(the param "srh_txt=%u05E0%u05D9%u05D1" encodes srh_txt=ניב in UNICODE)
It doesn't really. That's not URL-encoding and the sequence %u
is invalid in a URL.
%u05E0%u05D9%u05D1"
encodes ניב
only in JavaScript's oddball escape
syntax. escape
is the same as URL-encoding for all ASCII characters except for +
, but the %u####
escapes it produces for Unicode characters are completely of its own invention.
(One should, in general, never use escape
. Using encodeURIComponent
instead produces the correct URL-encoded UTF-8, ניב
=%D7%A0%D7%99%D7%91
.)
If a site requires %u####
sequences in its query string, it is very badly broken.
Is there any way of creating URI in non UTF-8 encoding?
Yes, URIs may use any character encoding you like. It is conventionally UTF-8; that's what IRI requires and what browsers will usually submit if the user types non-ASCII characters into the address bar, but URI itself concerns itself only with bytes.
So you could convert ניב
to %F0%E9%E1
. There would be no way for the web app to tell that those bytes represented characters encoded in code page 1255 (Hebrew, similar to ISO-8859-8). But it does appear to work, on the link above, which the UTF-8 version does not. Oh dear!