Could you explain in detail what the difference is between byte string and Unicode string in Python. I have read this:
Byte code is simply the convert
Here's an attempt at a simple explanation that only applies to Python 3. I hope that coming from a lay person, it would help to clear some confusion for the completely uninitiated. If there are any technical inaccuracies, please forgive me and feel free to point it out.
Suppose you create a string using Python 3 in the usual way:
stringobject = 'ant'
stringobject
would be a unicode string.
A unicode string is made up of unicode characters. In stringobject
above, the unicode characters are the individual letters, e.g. a, n, t
Each unicode character is assigned a code point, which can be expressed as a sequence of hex digits (a hex digit can take on 16 values, ranging from 0-9 and A-F). For instance, the letter 'a'
is equivalent to '\u0061'
, and 'ant' is equivalent to '\u0061\u006E\u0074'
.
So you will find that if you type in,
stringobject = '\u0061\u006E\u0074'
stringobject
You will also get the output 'ant'
.
Now, unicode is converted to bytes, in a process known as encoding. The reverse process of converting bytes to unicode is known as decoding.
How is this done? Since each hex digit can take on 16 different values, it can be reflected in a 4-bit binary sequence (e.g. the hex digit 0 can be expressed in binary as 0000, the hex digit 1 can be expressed as 0001 and so forth). If a unicode character has a code point consisting of four hex digits, it would need a 16-bit binary sequence to encode it.
Different encoding systems specify different rules for converting unicode to bits. Most importantly, encodings differ in the number of bits they use to express each unicode character.
For instance, the ASCII encoding system uses only 8 bits (1 byte) per character. Thus it can only encode unicode characters with code points up to two hex digits long (i.e. 256 different unicode characters). The UTF-8 encoding system uses 8 to 32 bits (1 to 4 bytes) per character, so it can encode unicode characters with code points up to 8 hex digits long, i.e. everything.
Running the following code:
byteobject = stringobject.encode('utf-8')
byteobject, type(byteobject)
converts a unicode string into a byte string using the utf-8 encoding system, and returns b'ant', bytes'
.
Note that if you used 'ASCII' as the encoding system, you wouldn't run into any problems since all code points in 'ant' can be expressed with 1 byte. But if you had a unicode string containing characters with code points longer than two hex digits, you would get a UnicodeEncodeError
.
Similarly,
stringobject = byteobject.decode('utf-8')
stringobject, type(stringobject)
gives you 'ant', str
.
No python does not use its own encoding. It will use any encoding that it has access to and that you specify. A character in a str
represents one unicode character. However to represent more than 256 characters, individual unicode encodings use more than one byte per character to represent many characters. bytearray
objects give you access to the underlaying bytes. str
objects have the encode
method that takes a string representing an encoding and returns the bytearray
object that represents the string in that encoding. bytearray
objects have the decode
method that takes a string representing an encoding and returns the str
that results from interpreting the bytearray
as a string encoded in the the given encoding. Here's an example.
>>> a = "αά".encode('utf-8')
>>> a
b'\xce\xb1\xce\xac'
>>> a.decode('utf-8')
'αά'
We can see that UTF-8 is using four bytes, \xce, \xb1, \xce, and \xac to represent two characters. After the Spolsky article that Ignacio Vazquez-Abrams referred to, I would read the Python Unicode Howto.