Unicode vs UTF-8 confusion in Python / Django?

前端 未结 5 1922
渐次进展
渐次进展 2020-12-14 00:53

I stumbled over this passage in the Django tutorial:

Django models have a default str() method that calls unicode()

相关标签:
5条回答
  • 2020-12-14 01:10

    Python stores Unicode as UTF-16. str() will return the UTF-8 representation of the UTF-16 string.

    0 讨论(0)
  • 2020-12-14 01:16

    so what is a "Unicode string" in Python?

    Python 'knows' that your string is Unicode. Hence if you do regex on it, it will know which is character and which is not etc, which is really helpful. If you did a strlen it will also give the correct result. As an example if you did string count on Hello, you will get 5 (even if it's Unicode). But if you did a string count of a foreign word and that string was not a Unicode string than you will have much larger result. Pythong uses the information form the Unicode Character Database to identify each character in the Unicode String. Hope that helps.

    0 讨论(0)
  • 2020-12-14 01:20

    From Wikipedia on UTF-8:

    UTF-8 (8-bit UCS/Unicode Transformation Format) is a variable-length character encoding for Unicode. It is able to represent any character in the Unicode standard, yet the initial encoding of byte codes and character assignments for UTF-8 is backwards compatible with ASCII. For these reasons, it is steadily becoming the preferred encoding for e-mail, web pages[1], and other places where characters are stored or streamed.

    So, it's anywhere between one and four bytes depending on which character you wish to represent within the realm of Unicode.

    From Wikipedia on Unicode:

    In computing, Unicode is an industry standard allowing computers to consistently represent and manipulate text expressed in most of the world's writing systems.

    So it's able to represent most (but not all) of the world's writing systems.

    I hope this helps :)

    0 讨论(0)
  • 2020-12-14 01:23

    what is a "Unicode string" in Python? Does that mean UCS-2?

    Unicode strings in Python are stored internally either as UCS-2 (fixed-length 16-bit representation, almost the same as UTF-16) or UCS-4/UTF-32 (fixed-length 32-bit representation). It's a compile-time option; on Windows it's always UTF-16 whilst many Linux distributions set UTF-32 (‘wide mode’) for their versions of Python.

    You are generally not supposed to care: you will see Unicode code-points as single elements in your strings and you won't know whether they're stored as two or four bytes. If you're in a UTF-16 build and you need to handle characters outside the Basic Multilingual Plane you'll be Doing It Wrong, but that's still very rare, and users who really need the extra characters should be compiling wide builds.

    plain wrong, or is it?

    Yes, it's quite wrong. To be fair I think that tutorial is rather old; it probably pre-dates wide Unicode strings, if not Unicode 3.1 (the version that introduced characters outside the Basic Multilingual Plane).

    There is an additional source of confusion stemming from Windows's habit of using the term “Unicode” to mean, specifically, the UTF-16LE encoding that NT uses internally. People from Microsoftland may often copy this somewhat misleading habit.

    0 讨论(0)
  • 2020-12-14 01:26

    Meanwhile, I did a refined research to verify what the internal representation in Python is, and also what its limits are. "The Truth About Unicode In Python" is a very good article which cites directly from the Python developers. Apparently, internal representation is either UCS-2 or UCS-4 depending on a compile-time switch. So Jon, it's not UTF-16, but your answer put me on the right track anyway, thanks.

    0 讨论(0)
提交回复
热议问题