I received some text that is encoded, but I don\'t know what charset was used. Is there a way to determine the encoding of a text file using Python? How can I detect the enc
If you know the some content of the file you can try to decode it with several encoding and see which is missing. In general there is no way since a text file is a text file and those are stupid ;)
Another option for working out the encoding is to use libmagic (which is the code behind the file command). There are a profusion of python bindings available.
The python bindings that live in the file source tree are available as the python-magic (or python3-magic) debian package. It can determine the encoding of a file by doing:
import magic
blob = open('unknown-file', 'rb').read()
m = magic.open(magic.MAGIC_MIME_ENCODING)
m.load()
encoding = m.buffer(blob) # "utf-8" "us-ascii" etc
There is an identically named, but incompatible, python-magic pip package on pypi that also uses libmagic
. It can also get the encoding, by doing:
import magic
blob = open('unknown-file', 'rb').read()
m = magic.Magic(mime_encoding=True)
encoding = m.from_buffer(blob)
Depending on your platform, I just opt to use the linux shell file
command. This works for me since I am using it in a script that exclusively runs on one of our linux machines.
Obviously this isn't an ideal solution or answer, but it could be modified to fit your needs. In my case I just need to determine whether a file is UTF-8 or not.
import subprocess
file_cmd = ['file', 'test.txt']
p = subprocess.Popen(file_cmd, stdout=subprocess.PIPE)
cmd_output = p.stdout.readlines()
# x will begin with the file type output as is observed using 'file' command
x = cmd_output[0].split(": ")[1]
return x.startswith('UTF-8')
This might be helpful
from bs4 import UnicodeDammit
with open('automate_data/billboard.csv', 'rb') as file:
content = file.read()
suggestion = UnicodeDammit(content)
suggestion.original_encoding
#'iso-8859-1'
Correctly detecting the encoding all times is impossible.
(From chardet FAQ:)
However, some encodings are optimized for specific languages, and languages are not random. Some character sequences pop up all the time, while other sequences make no sense. A person fluent in English who opens a newspaper and finds “txzqJv 2!dasd0a QqdKjvz” will instantly recognize that that isn't English (even though it is composed entirely of English letters). By studying lots of “typical” text, a computer algorithm can simulate this kind of fluency and make an educated guess about a text's language.
There is the chardet library that uses that study to try to detect encoding. chardet is a port of the auto-detection code in Mozilla.
You can also use UnicodeDammit. It will try the following methods:
This site has python code for recognizing ascii, encoding with boms, and utf8 no bom: https://unicodebook.readthedocs.io/guess_encoding.html. Read file into byte array (data): http://www.codecodex.com/wiki/Read_a_file_into_a_byte_array. Here's an example. I'm in osx.
#!/usr/bin/python
import sys
def isUTF8(data):
try:
decoded = data.decode('UTF-8')
except UnicodeDecodeError:
return False
else:
for ch in decoded:
if 0xD800 <= ord(ch) <= 0xDFFF:
return False
return True
def get_bytes_from_file(filename):
return open(filename, "rb").read()
filename = sys.argv[1]
data = get_bytes_from_file(filename)
result = isUTF8(data)
print(result)
PS /Users/js> ./isutf8.py hi.txt
True