I received some text that is encoded, but I don\'t know what charset was used. Is there a way to determine the encoding of a text file using Python? How can I detect the enc
Here is an example of reading and taking at face value a chardet
encoding prediction, reading n_lines
from the file in the event it is large.
chardet
also gives you a probability (i.e. confidence
) of it's encoding prediction (haven't looked how they come up with that), which is returned with its prediction from chardet.predict()
, so you could work that in somehow if you like.
def predict_encoding(file_path, n_lines=20):
'''Predict a file's encoding using chardet'''
import chardet
# Open the file as binary data
with open(file_path, 'rb') as f:
# Join binary lines for specified number of lines
rawdata = b''.join([f.readline() for _ in range(n_lines)])
return chardet.detect(rawdata)['encoding']