Glove6b50d parsing: could not convert string to float: '-'

左心房为你撑大大i 提交于 2020-05-17 06:04:23


I am trying to parse the Glove6b50d data from Kaggle in via Google Colab, then run it through the word2vec process (apologies for the huge URL - it's the fastest link I've found). However, I'm hitting a bug where '-' tokens are not parsed correctly, resulting in the above error.

I have attempted to handle this in a few ways. I've also looked into the load_word2vec_format method itself and tried to ignore errors, however it doesn't seem to make a difference. I've tried a map method on line two, following combinations of advice from these links: [a] and [b]. This hasn't fixed or changed the error message received (i.e. removing it changes nothing in the text).

gloveFile = pd.read_fwf("",
                    delimiter="\n\t\s+", header=None)

map(lambda gloveFile: gloveFile.replace(r'[^\x00-\x7F]+' , '-'), gloveFile[0])

numpy.savetxt(r'/usr/local/lib/python3.6/dist-packages/gensim/test/test_data/glove6b50d.txt', gloveFile.values, fmt="%s")

from gensim.models import KeyedVectors
from gensim.test.utils import datapath, get_tmpfile
from gensim.scripts.glove2word2vec import glove2word2vec

glove_file = datapath('glove6b50d.txt')

glove2word2vec(glove_file, "glove6b50d_word2vec.txt")

model = KeyedVectors.load_word2vec_format("glove6b50d_word2vec.txt", binary=False)

Per the comment below, the exact error I'm getting is as follows:

/usr/local/lib/python3.6/dist-packages/smart_open/ UserWarning: This function is deprecated, use instead. See the migration notes for details:
  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL
ValueError                                Traceback (most recent call last)
<ipython-input-132-6ad5a51f4fb3> in <module>()
      9 glove2word2vec(glove_file, "glove6b50d_word2vec.txt")
---> 11 model = KeyedVectors.load_word2vec_format("glove6b50d_word2vec.txt", binary=False)

2 frames
/usr/local/lib/python3.6/dist-packages/gensim/models/ in <listcomp>(.0)
    220                 if len(parts) != vector_size + 1:
    221                     raise ValueError("invalid vector on line %s (is this really the text format?)" % line_no)
--> 222                 word, weights = parts[0], [datatype(x) for x in parts[1:]]
    223                 add_word(word, weights)
    224     if result.vectors.shape[0] != len(result.vocab):

ValueError: could not convert string to float: '-'

The system works fine using a text file containing only: "test -1.0 1.526 -2.55" or "- -1.0 1.526 -2.55". Additionally, searching the source text file (glove.6B.50d.txt) for occurrences of " - " comes up with no results. I'm on Windows, so I have done so by executing:

findstr /C:" - " glove.6B.50d.txt

Calling print(gloveFile) both pre- and post-map call provide the following output. Note that I've kept the mapping call in for completeness of my efforts, not for its effect.

0       the 0.418 0.24968 -0.41242 0.1217 0.34527 -0.0...
1       , 0.013441 0.23682 -0.16899 0.40951 0.63812 0....
2       . 0.15164 0.30177 -0.16763 0.17684 0.31719 0.3...
3       of 0.70853 0.57088 -0.4716 0.18048 0.54449 0.7...
4       to 0.68047 -0.039263 0.30186 -0.17792 0.42962 ...
...                                                   ...
399995  chanty 0.23204 0.025672 -0.70699 -0.045465 0.1...
399996  kronik -0.60921 -0.67218 0.23521 -0.11195 -0.4...
399997  rolonda -0.51181 0.058706 1.0913 -0.55163 -0.1...
399998  zsombor -0.75898 -0.47426 0.4737 0.7725 -0.780...
399999  andberger 0.072617 -0.51393 0.4728 -0.52202 -0...

If I print the first ten lines of the glove6b50d_word2vec.txt file, I get the following text, which matches the word2vec format. Additionally, if I count the occurrences of the string " - " in the document, I find none.

['400000 50\n', 'the 0.418 0.24968 -0.41242 0.1217 0.34527 -0.044457 -0.49688 -0.17862 -0.00066023 -0.6566 0.27843 -0.14767 -0.55677 0.14658 -0.0095095 0.011658 0.10204 -0.12792 -0.8443 -0.12181 -0.016801 -0.33279 -0.1552 -0.23131 -0.19181 -1.8823 -0.76746 0.099051 -0.42125 -0.19526 4.0071 -0.18594 -0.52287 -0.31681 0.00059213 0.0074449 0.17778 -0.15897 0.012041 -0.054223 -0.29871 -0.15749 -0.34758 -0.045637 -0.44251 0.18785 0.0027849 -0.18411 -0.11514 -0.78581\n', ', 0.013441 0.23682 -0.16899 0.40951 0.63812 0.47709 -0.42852 -0.55641 -0.364 -0.23938 0.13001 -0.063734 -0.39575 -0.48162 0.23291 0.090201 -0.13324 0.078639 -0.41634 -0.15428 0.10068 0.48891 0.31226 -0.1252 -0.037512 -1.5179 0.12612 -0.02442 -0.042961 -0.28351 3.5416 -0.11956 -0.014533 -0.1499 0.21864 -0.33412 -0.13872 0.31806 0.70358 0.44858 -0.080262 0.63003 0.32111 -0.46765 0.22786 0.36034 -0.37818 -0.56657 0.044691 0.30392\n', '. 0.15164 0.30177 -0.16763 0.17684 0.31719 0.33973 -0.43478 -0.31086 -0.44999 -0.29486 0.16608 0.11963 -0.41328 -0.42353 0.59868 0.28825 -0.11547 -0.041848 -0.67989 -0.25063 0.18472 0.086876 0.46582 0.015035 0.043474 -1.4671 -0.30384 -0.023441 0.30589 -0.21785 3.746 0.0042284 -0.18436 -0.46209 0.098329 -0.11907 0.23919 0.1161 0.41705 0.056763 -6.3681e-05 0.068987 0.087939 -0.10285 -0.13931 0.22314 -0.080803 -0.35652 0.016413 0.10216\n', 'of 0.70853 0.57088 -0.4716 0.18048 0.54449 0.72603 0.18157 -0.52393 0.10381 -0.17566 0.078852 -0.36216 -0.11829 -0.83336 0.11917 -0.16605 0.061555 -0.012719 -0.56623 0.013616 0.22851 -0.14396 -0.067549 -0.38157 -0.23698 -1.7037 -0.86692 -0.26704 -0.2589 0.1767 3.8676 -0.1613 -0.13273 -0.68881 0.18444 0.0052464 -0.33874 -0.078956 0.24185 0.36576 -0.34727 0.28483 0.075693 -0.062178 -0.38988 0.22902 -0.21617 -0.22562 -0.093918 -0.80375\n', 'to 0.68047 -0.039263 0.30186 -0.17792 0.42962 0.032246 -0.41376 0.13228 -0.29847 -0.085253 0.17118 0.22419 -0.10046 -0.43653 0.33418 0.67846 0.057204 -0.34448 -0.42785 -0.43275 0.55963 0.10032 0.18677 -0.26854 0.037334 -2.0932 0.22171 -0.39868 0.20912 -0.55725 3.8826 0.47466 -0.95658 -0.37788 0.20869 -0.32752 0.12751 0.088359 0.16351 -0.21634 -0.094375 0.018324 0.21048 -0.03088 -0.19722 0.082279 -0.09434 -0.073297 -0.064699 -0.26044\n', 'and 0.26818 0.14346 -0.27877 0.016257 0.11384 0.69923 -0.51332 -0.47368 -0.33075 -0.13834 0.2702 0.30938 -0.45012 -0.4127 -0.09932 0.038085 0.029749 0.10076 -0.25058 -0.51818 0.34558 0.44922 0.48791 -0.080866 -0.10121 -1.3777 -0.10866 -0.23201 0.012839 -0.46508 3.8463 0.31362 0.13643 -0.52244 0.3302 0.33707 -0.35601 0.32431 0.12041 0.3512 -0.069043 0.36885 0.25168 -0.24517 0.25381 0.1367 -0.31178 -0.6321 -0.25028 -0.38097\n', 'in 0.33042 0.24995 -0.60874 0.10923 0.036372 0.151 -0.55083 -0.074239 -0.092307 -0.32821 0.09598 -0.82269 -0.36717 -0.67009 0.42909 0.016496 -0.23573 0.12864 -1.0953 0.43334 0.57067 -0.1036 0.20422 0.078308 -0.42795 -1.7984 -0.27865 0.11954 -0.12689 0.031744 3.8631 -0.17786 -0.082434 -0.62698 0.26497 -0.057185 -0.073521 0.46103 0.30862 0.12498 -0.48609 -0.0080272 0.031184 -0.36576 -0.42699 0.42164 -0.11666 -0.50703 -0.027273 -0.53285\n', 'a 0.21705 0.46515 -0.46757 0.10082 1.0135 0.74845 -0.53104 -0.26256 0.16812 0.13182 -0.24909 -0.44185 -0.21739 0.51004 0.13448 -0.43141 -0.03123 0.20674 -0.78138 -0.20148 -0.097401 0.16088 -0.61836 -0.18504 -0.12461 -2.2526 -0.22321 0.5043 0.32257 0.15313 3.9636 -0.71365 -0.67012 0.28388 0.21738 0.14433 0.25926 0.23434 0.4274 -0.44451 0.13813 0.36973 -0.64289 0.024142 -0.039315 -0.26037 0.12017 -0.043782 0.41013 0.1796\n', '" 0.25769 0.45629 -0.76974 -0.37679 0.59272 -0.063527 0.20545 -0.57385 -0.29009 -0.13662 0.32728 1.4719 -0.73681 -0.12036 0.71354 -0.46098 0.65248 0.48887 -0.51558 0.039951 -0.34307 -0.014087 0.86488 0.3546 0.7999 -1.4995 -1.8153 0.41128 0.23921 -0.43139 3.6623 -0.79834 -0.54538 0.16943 -0.82017 -0.3461 0.69495 -1.2256 -0.17992 -0.057474 0.030498 -0.39543 -0.38515 -1.0002 0.087599 -0.31009 -0.34677 -0.31438 0.75004 0.97065\n']

My search methods are evidently thusfar ineffective. Would really appreciate some help.


In can't reproduce the problem running the following code (on a linux machine, Python 3.6):

In [1]: from gensim.models import KeyedVectors 

In [2]: from gensim.scripts.glove2word2vec import glove2word2vec 

In [3]: glove2word2vec('glove.6B.50d.txt', 'glove.68.50d.w2v.txt')                        
Out[3]: (400000, 50)

In [4]: model = KeyedVectors.load_word2vec_format('glove.68.50d.w2v.txt')                                        

In [5]: len(model)                                                                                               
Out[5]: 400000

In [6]: model['the']                                                                                       

array([ 4.1800e-01,  2.4968e-01, -4.1242e-01,  1.2170e-01,  3.4527e-01,
       -4.4457e-02, -4.9688e-01, -1.7862e-01, -6.6023e-04, -6.5660e-01,
        2.7843e-01, -1.4767e-01, -5.5677e-01,  1.4658e-01, -9.5095e-03,
        1.1658e-02,  1.0204e-01, -1.2792e-01, -8.4430e-01, -1.2181e-01,
       -1.6801e-02, -3.3279e-01, -1.5520e-01, -2.3131e-01, -1.9181e-01,
       -1.8823e+00, -7.6746e-01,  9.9051e-02, -4.2125e-01, -1.9526e-01,
        4.0071e+00, -1.8594e-01, -5.2287e-01, -3.1681e-01,  5.9213e-04,
        7.4449e-03,  1.7778e-01, -1.5897e-01,  1.2041e-02, -5.4223e-02,
       -2.9871e-01, -1.5749e-01, -3.4758e-01, -4.5637e-02, -4.4251e-01,
        1.8785e-01,  2.7849e-03, -1.8411e-01, -1.1514e-01, -7.8581e-01],

Do these exact lines trigger the exact same error as originally reported for you? (If you still get an error, but the error is even the slightest bit different, can you add the updated error to your question?)

My best guess if you're still having a problem is some Windows-specific default-encoding mangling during one of the steps, or if the file was opened/saved in some other editor.

