How to fix: “UnicodeDecodeError: 'ascii' codec can't decode byte”

前端 未结 19 1543
谎友^
谎友^ 2020-11-22 01:21
as3:~/ngokevin-site# nano content/blog/20140114_test-chinese.mkd
as3:~/ngokevin-site# wok
Traceback (most recent call last):
File \"/usr/local/bin/wok\", line 4, in
         


        
19条回答
  •  灰色年华
    2020-11-22 02:08

    Here is my solution, just add the encoding. with open(file, encoding='utf8') as f

    And because reading glove file will take a long time, I recommend to the glove file to a numpy file. When netx time you read the embedding weights, it will save your time.

    import numpy as np
    from tqdm import tqdm
    
    
    def load_glove(file):
        """Loads GloVe vectors in numpy array.
        Args:
            file (str): a path to a glove file.
        Return:
            dict: a dict of numpy arrays.
        """
        embeddings_index = {}
        with open(file, encoding='utf8') as f:
            for i, line in tqdm(enumerate(f)):
                values = line.split()
                word = ''.join(values[:-300])
                coefs = np.asarray(values[-300:], dtype='float32')
                embeddings_index[word] = coefs
    
        return embeddings_index
    
    # EMBEDDING_PATH = '../embedding_weights/glove.840B.300d.txt'
    EMBEDDING_PATH = 'glove.840B.300d.txt'
    embeddings = load_glove(EMBEDDING_PATH)
    
    np.save('glove_embeddings.npy', embeddings) 
    

    Gist link: https://gist.github.com/BrambleXu/634a844cdd3cd04bb2e3ba3c83aef227

提交回复
热议问题