问题
This seems to be the type of question that should have a lot of duplicates and plenty of answers, but my searches have led only to frustration and no useable solutions.
In Python (preferably 3.x), I would like to know how I can open a file of an arbitrary type, read the bytes that are stored on disk, and present those bytes in their most 'native', 'original', 'raw' form, before any encoding is done on them.
If the file is stored on disk as a stream of 00010100 10000100 ...
then that's what I would like to have presented on the screen.
These sort of questions usually elicit the response 'why do you want to know' and 'what's the use case'. I'm curious, that's my use case.
Before you mark this as duplicate, please be sure that the answer you have in mind does indeed answer the question (rather than merely discuss encodings, etc.). Thank you!
EDIT AFTER FIRST THREE ANSWERS:
Thanks to the three responders up to this point, and especially to J.F. Sebastian for the extended discussion. It appears from what has been said that my question boils down to how bytes in files are physically recorded to disk and how they can be read and presented. At this point it doesn't seem possible in Python to obtain a view on to the bytes in their raw form, but they are available in various representations; integers, hex values, ascii, etc. As the matter isn't settled, I will leave the question open for more input.
回答1:
'rb'
mode enables you to read raw binary data from a file in Python:
with open(filename, 'rb') as file:
raw_binary_data = file.read()
type(raw_binary_data) == bytes
. bytes
is an immutable sequence of bytes in Python.
Don't confuse bytes and their text representation: print(raw_binary_data)
would show you the text representation of the data e.g., a byte 127
(base 10: decimal) that you can represent asbin(127) == '0b1111111'
(base 2: binary) or as hex(127) == '0x7f'
(base 16: hexadecimal) is shown as b'\x7f'
(seven ascii characters are printed). Bytes from the printable ascii range are represented as the corresponding ascii characters e.g., b'\x41'
is shown as b'A'
(65 == 0x41 == 0b1000001
).
0x7f
byte is not stored on disk as seven ascii binary digits 1111111
, it is not stored as two ascii hex digits: 7F
, it is not stored as three literal decimal digits 127
. b'\x7f'
is a text representation of the byte that may be used to specify it in Python source code (you won't find literal seven ascii characters b'\x7f'
on disk too).
This code writes a single byte to disk:
with open('output.bin', 'wb') as file:
file.write(b'\x7f')
Some kind of characters must be used to represent the bytes, what are they?
OS interfaces (the way you access hardware such as disks) are defined in terms of bytes e.g., POSIX read(2) i.e., the byte is a fundamental unit here: you can read/write bytes directly -- you don't need any intermediate representation. Watch Richard Feynman. Why.
How bytes are represented physically is between OS drivers and the hardware -- it may be anything -- you don't need to worry about it: it is hidden behind the uniform OS interface. See How is data physically written, read and stored inside hard drives?
You could call os.read()
directly in Python but you don't need it; file.read()
does it for you (Python 3 file objects are implemented on top of POSIX interface directly. Python 2 I/O uses C stdio library that in turn uses OS interfaces to implement its functionality).
As you point out, it's up to the OS drivers and hardware to establish how bytes are written, but the Python interpreter would then be able to read them. So it's reading something - what is that? It's not reading magnetic orientation of particles on the disk, is it? It's reading something symbolic, and I want access to it.
It's reading bytes. A hard disk is a small computer and therefore interesting things may happen but it does not change that It's bytes all the way down (as far as "symbolic" or software is concerned).
The book "CODE The Hidden Language of Computer Hardware and Software" provides a very gentle introduction into how information is represented in computers — the word "byte" is not defined until page 180. To see through abstraction levels used in computers, the course "From NAND to Tetris" can help.
回答2:
If you're fine with bytes:
with open('yourfile', 'rb') as fobj:
raw_bytes = fobj.read()
print(raw_bytes)
If you really want binary:
with open('yourfile', 'rb') as fobj:
raw_bytes = fobj.read()
print(' '.join(map(lambda x: '{:08b}'.format(x), raw_bytes)))
回答3:
Python 3 represents file data as bytes
. The type is basically a list of integers from 0 to 255 so a list of bytes. They have some convenience methods (decoding to string for example) and they presented similar to strings when printed.
To get the bit by bit representation, you should use the b
mode when opening the file.
bin()
will help you to convert the integers to binary representation. But you might have to strip the first two characters and fill up with 0
s.
with open(filename, 'rb') as my_file:
my_bytes = my_file.read()
bin_list = [bin(i)[2:].rjust(8, '0') for i in my_bytes]
print(' '.join(bin_list))
来源:https://stackoverflow.com/questions/33145337/how-to-open-and-present-raw-binary-data-in-python