What\'s a Python bytestring?
All I can find are topics on how to encode to bytestring or decode to ascii
or utf-8
. I\'m trying to understa
Python does not know how to represent a bytestring. That's the point.
When you output a character with value 97 into pretty much any output window, you'll get the character 'a' but that's not part of the implementation; it's just a thing that happens to be locally true. If you want an encoding, you don't use bytestring. If you use bytestring, you don't have an encoding.
Your piece about .txt files shows you have misunderstood what is happening. You see, plain text files too don't have an encoding. They're just a series of bytes. These bytes get translated into letters by the text editor but there is no guarantee at all that someone else opening your file will see the same thing as you if you stray outside the common set of ASCII characters.
It is a common misconception that text is ascii or utf8 or cp1252, and therefore bytes are text.
Text is only text, in the way that images are only images. The matter of storing text or images to disk is a matter of encoding that data into a sequence of bytes. There are many ways to encode images into bytes: Jpeg, png, svg, and likewise many ways to encode text, ascii, utf8 or cp1252.
Once encoding has happened, bytes are just bytes. Bytes are not images anymore, they have forgotten the colors they mean; although an image format decoder can recover that information. Bytes have similarly forgotten the letters they used to be. In fact, bytes don't remember wether they were images or text at all. Only out of band knowledge (filename, media headers, etcetera) can guess what those bytes should mean, and even that can be wrong (in case of data corruption)
so, in python (py3), we have two types for things that might otherwise look similar; For text, we have str
, which knows it's text; it knows which letters it's supposed to mean. It doesn't know which bytes that might be, since letters are not bytes. We also have bytestring
, which doesn't know if it's text or images or any other kind of data.
The two types are superficially similar, since they are both sequences of things, but the things that they are sequences of is quite different.
Implementationally, str
is stored in memory as UCS-?
where the ? is implementation defined, it may be UCS4, UCS2 or UCS1, depending on compile time options and which codepoints are present in the represented string.
edit "but why"?
Some things that look like text are actually defined in other terms. A really good example of this are the many internet protocols of the world. For instance, HTTP is a "text" protocol that is in fact defined using the ABNF syntax common in RFC's. These protocols are expressed in terms of octets, not characters, although an informal encoding may also be suggested:
2.3. Terminal Values
Rules resolve into a string of terminal values, sometimes called
characters. In ABNF, a character is merely a non-negative integer.
In certain contexts, a specific mapping (encoding) of values into a
character set (such as ASCII) will be specified.
This distinction is important, because it's not possible to send text over the internet, the only thing you can do is send bytes. saying "text but in 'foo' encoding" makes the format that much more complex, since clients and servers need to now somehow figure out the encoding business on their own, hopefully in the same way, since they must ultimately pass data around as bytes anyway. This is doubly useless since these protocols are seldom about text handling anyway, and is only a convenience for implementers. Neither the server owners nor end users are ever interested in reading the words Transfer-Encoding: chunked
, so long as both the server and the browser understand it correctly.
By comparison, when working with text, you don't really care how it's encoded. You can express the "Heävy Mëtal Ümlaüts" any way you like, except "Heδvy Mλtal άmlaόts"
the distinct types thus give you a way to say "this value 'means' text" or "bytes".
As the name implies, a Python3 bytestring
(or simply a str
in
Python 2.7) is a string of bytes. And, as others have pointed out, it is immutable.
It is distinct from a Python3
str
(or, more descriptively, a unicode
in Python 2.7) which is a
string of abstract unicode characters (a.k.a. UTF-32, though Python3 adds fancy compression under the hood to reduce the actual memory footprint similar to UTF-8, perhaps even in a more general way).
There are essentially three ways of "interpreting" these bytes. You can look at the numeric value of an element, like this:
>>> ord(b'Hello'[0]) # Python 2.7 str
72
>>> b'Hello'[0] # Python 3 bytestring
72
Or you can tell Python to emit one or more elements to the terminal (or a file, device, socket, etc.) as 8-bit characters, like this:
>>> print b'Hello'[0] # Python 2.7 str
H
>>> import sys
>>> sys.stdout.buffer.write(b'Hello'[0:1]) and None; print() # Python 3 bytestring
H
As Jack hinted, in this latter case it is your terminal interpreting the character, not Python.
Finally, as you have seen in your own research, you can also get Python to interpret a bytestring
. For example, you can construct an abstract unicode
object like this in Python 2.7:
>>> u1234 = unicode(b'\xe1\x88\xb4', 'utf-8')
>>> print u1234.encode('utf-8') # if terminal supports UTF-8
ሴ
>>> u1234
u'\u1234'
>>> print ('%04x' % ord(u1234))
1234
>>> type(u1234)
<type 'unicode'>
>>> len(u1234)
1
>>>
Or like this in Python 3:
>>> u1234 = str(b'\xe1\x88\xb4', 'utf-8')
>>> print (u1234) # if terminal supports UTF-8 AND python auto-infers
ሴ
>>> u1234.encode('unicode-escape')
b'\\u1234'
>>> print ('%04x' % ord(u1234))
1234
>>> type(u1234)
<class 'str'>
>>> len(u1234)
1
(and I am sure that the amount of syntax churn between Python2.7 and Python3 around bystestring, strings, and unicode had something to do with the continued popularity of Python2.7. I suppose that when Python3 was invented they didn't yet realize that everything would become UTF-8 and therefore all the fuss about abstraction was unnecessary)
But unicode abstraction does not happen automatically if you don't want it to. The point of a bytestring
is that you can directly get at the bytes. Even if your string happens to be a UTF-8 sequence, you can still access bytes in the sequence:
>>> len(b'\xe1\x88\xb4')
3
>>> b'\xe1\x88\xb4'[0]
'\xe1'
and this works in both Python2.7 and Python3, with the difference being that in Python2.7 you have str
, while in Python3 you have bytestring
.
You can also do other wonderful things with bytestring
s, like knowing if they will fit in a reserved space within a file, sending them directly over a socket, calculating the HTTP content-length
field correctly, and avoiding Python Bug 8260. In short, use bytestring
s when your data is processed and stored in bytes.
Bytes objects are immutable sequences of single bytes. The docs have a very good explanation of what they are and how to use them.