URI encoding in UNICODE for apache httpclient 4

前端 未结 1 1646
温柔的废话
温柔的废话 2021-01-23 09:47

I am working with apache http client 4 for all of my web accesses. This means that every query that I need to do has to pass the URI syntax checks. One of the sites that I am tr

1条回答
  •  清歌不尽
    2021-01-23 10:45

    (the param "srh_txt=%u05E0%u05D9%u05D1" encodes srh_txt=ניב in UNICODE)

    It doesn't really. That's not URL-encoding and the sequence %u is invalid in a URL.

    %u05E0%u05D9%u05D1" encodes ניב only in JavaScript's oddball escape syntax. escape is the same as URL-encoding for all ASCII characters except for +, but the %u#### escapes it produces for Unicode characters are completely of its own invention.

    (One should, in general, never use escape. Using encodeURIComponent instead produces the correct URL-encoded UTF-8, ניב=%D7%A0%D7%99%D7%91.)

    If a site requires %u#### sequences in its query string, it is very badly broken.

    Is there any way of creating URI in non UTF-8 encoding?

    Yes, URIs may use any character encoding you like. It is conventionally UTF-8; that's what IRI requires and what browsers will usually submit if the user types non-ASCII characters into the address bar, but URI itself concerns itself only with bytes.

    So you could convert ניב to %F0%E9%E1. There would be no way for the web app to tell that those bytes represented characters encoded in code page 1255 (Hebrew, similar to ISO-8859-8). But it does appear to work, on the link above, which the UTF-8 version does not. Oh dear!

    0 讨论(0)
提交回复
热议问题