as3:~/ngokevin-site# nano content/blog/20140114_test-chinese.mkd
as3:~/ngokevin-site# wok
Traceback (most recent call last):
File \"/usr/local/bin/wok\", line 4, in
Got a same error and this solved my error. Thanks! python 2 and python 3 differing in unicode handling is making pickled files quite incompatible to load. So Use python pickle's encoding argument. Link below helped me solve the similar problem when I was trying to open pickled data from my python 3.7, while my file was saved originally in python 2.x version. https://blog.modest-destiny.com/posts/python-2-and-3-compatible-pickle-save-and-load/ I copy the load_pickle function in my script and called the load_pickle(pickle_file) while loading my input_data like this:
input_data = load_pickle("my_dataset.pkl")
The load_pickle function is here:
def load_pickle(pickle_file):
try:
with open(pickle_file, 'rb') as f:
pickle_data = pickle.load(f)
except UnicodeDecodeError as e:
with open(pickle_file, 'rb') as f:
pickle_data = pickle.load(f, encoding='latin1')
except Exception as e:
print('Unable to load data ', pickle_file, ':', e)
raise
return pickle_data
This worked for me:
file = open('docs/my_messy_doc.pdf', 'rb')
This error occurs when there are some non ASCII characters in our string and we are performing any operations on that string without proper decoding. This helped me solve my problem. I am reading a CSV file with columns ID,Text and decoding characters in it as below:
train_df = pd.read_csv("Example.csv")
train_data = train_df.values
for i in train_data:
print("ID :" + i[0])
text = i[1].decode("utf-8",errors="ignore").strip().lower()
print("Text: " + text)
In order to resolve this on an operating system level in an Ubuntu installation check the following:
$ locale charmap
If you get
locale: Cannot set LC_CTYPE to default locale: No such file or directory
instead of
UTF-8
then set LC_CTYPE
and LC_ALL
like this:
$ export LC_ALL="en_US.UTF-8"
$ export LC_CTYPE="en_US.UTF-8"
Here is my solution, just add the encoding.
with open(file, encoding='utf8') as f
And because reading glove file will take a long time, I recommend to the glove file to a numpy file. When netx time you read the embedding weights, it will save your time.
import numpy as np
from tqdm import tqdm
def load_glove(file):
"""Loads GloVe vectors in numpy array.
Args:
file (str): a path to a glove file.
Return:
dict: a dict of numpy arrays.
"""
embeddings_index = {}
with open(file, encoding='utf8') as f:
for i, line in tqdm(enumerate(f)):
values = line.split()
word = ''.join(values[:-300])
coefs = np.asarray(values[-300:], dtype='float32')
embeddings_index[word] = coefs
return embeddings_index
# EMBEDDING_PATH = '../embedding_weights/glove.840B.300d.txt'
EMBEDDING_PATH = 'glove.840B.300d.txt'
embeddings = load_glove(EMBEDDING_PATH)
np.save('glove_embeddings.npy', embeddings)
Gist link: https://gist.github.com/BrambleXu/634a844cdd3cd04bb2e3ba3c83aef227
In some cases, when you check your default encoding (print sys.getdefaultencoding()
), it returns that you are using ASCII. If you change to UTF-8, it doesn't work, depending on the content of your variable.
I found another way:
import sys
reload(sys)
sys.setdefaultencoding('Cp1252')