问题
I am trying to parse the Glove6b50d data from Kaggle in via Google Colab, then run it through the word2vec process (apologies for the huge URL - it's the fastest link I've found). However, I'm hitting a bug where '-' tokens are not parsed correctly, resulting in the above error.
I have attempted to handle this in a few ways. I've also looked into the load_word2vec_format method itself and tried to ignore errors, however it doesn't seem to make a difference. I've tried a map method on line two, following combinations of advice from these links: [a] and [b]. This hasn't fixed or changed the error message received (i.e. removing it changes nothing in the text).
gloveFile = pd.read_fwf("https://storage.googleapis.com/kagglesdsdata/datasets/652874/1154868/glove.6B.50d.txt?GoogleAccessId=web-data@kaggle-161607.iam.gserviceaccount.com&Expires=1589683535&Signature=kaS%2FTkSmvp7lhqwLJ%2B1lyuvP76PcDpwK1dnsCZEO0AiVXqQm7jsBc1r5g9af%2BuVkOSvMgqUDXYL4O%2BN43pnL5RLs7ns%2B3w%2BEtCYDTfJz6q1O0zfPz4%2BTcD3GV7UAGgVjVNIvncC9fHWcd2YuKwiZaTvKL%2BGRnMkf9b%2BYnOweYeXEeA1sX005krj%2FLMBbVTXmDTwOtN4HwVNb3%2BrbezkWkoEC6sxLPnGcsEKaBe%2Biv%2FuVSQG5FsQlwvRgsSU%2FMgk0c4bi%2FHxF04lrQW0E0s767TIXwHeodRHYpk5KQeKmyd91uKD2Zb8v8xQcf2%2BkmSNGQHbX0mDz8HBwYEmOdV7aMQ%3D%3D&response-content-disposition=attachment%3B+filename%3Dglove.6B.50d.txt",
delimiter="\n\t\s+", header=None)
map(lambda gloveFile: gloveFile.replace(r'[^\x00-\x7F]+' , '-'), gloveFile[0])
numpy.savetxt(r'/usr/local/lib/python3.6/dist-packages/gensim/test/test_data/glove6b50d.txt', gloveFile.values, fmt="%s")
from gensim.models import KeyedVectors
from gensim.test.utils import datapath, get_tmpfile
from gensim.scripts.glove2word2vec import glove2word2vec
glove_file = datapath('glove6b50d.txt')
glove2word2vec(glove_file, "glove6b50d_word2vec.txt")
model = KeyedVectors.load_word2vec_format("glove6b50d_word2vec.txt", binary=False)
Per the comment below, the exact error I'm getting is as follows:
/usr/local/lib/python3.6/dist-packages/smart_open/smart_open_lib.py:253: UserWarning: This function is deprecated, use smart_open.open instead. See the migration notes for details: https://github.com/RaRe-Technologies/smart_open/blob/master/README.rst#migrating-to-the-new-open-function
'See the migration notes for details: %s' % _MIGRATION_NOTES_URL
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-132-6ad5a51f4fb3> in <module>()
9 glove2word2vec(glove_file, "glove6b50d_word2vec.txt")
10
---> 11 model = KeyedVectors.load_word2vec_format("glove6b50d_word2vec.txt", binary=False)
12
2 frames
/usr/local/lib/python3.6/dist-packages/gensim/models/utils_any2vec.py in <listcomp>(.0)
220 if len(parts) != vector_size + 1:
221 raise ValueError("invalid vector on line %s (is this really the text format?)" % line_no)
--> 222 word, weights = parts[0], [datatype(x) for x in parts[1:]]
223 add_word(word, weights)
224 if result.vectors.shape[0] != len(result.vocab):
ValueError: could not convert string to float: '-'
The system works fine using a text file containing only: "test -1.0 1.526 -2.55" or "- -1.0 1.526 -2.55". Additionally, searching the source text file (glove.6B.50d.txt) for occurrences of " - " comes up with no results. I'm on Windows, so I have done so by executing:
findstr /C:" - " glove.6B.50d.txt
Calling print(gloveFile)
both pre- and post-map call provide the following output. Note that I've kept the mapping call in for completeness of my efforts, not for its effect.
0 the 0.418 0.24968 -0.41242 0.1217 0.34527 -0.0...
1 , 0.013441 0.23682 -0.16899 0.40951 0.63812 0....
2 . 0.15164 0.30177 -0.16763 0.17684 0.31719 0.3...
3 of 0.70853 0.57088 -0.4716 0.18048 0.54449 0.7...
4 to 0.68047 -0.039263 0.30186 -0.17792 0.42962 ...
... ...
399995 chanty 0.23204 0.025672 -0.70699 -0.045465 0.1...
399996 kronik -0.60921 -0.67218 0.23521 -0.11195 -0.4...
399997 rolonda -0.51181 0.058706 1.0913 -0.55163 -0.1...
399998 zsombor -0.75898 -0.47426 0.4737 0.7725 -0.780...
399999 andberger 0.072617 -0.51393 0.4728 -0.52202 -0...
If I print the first ten lines of the glove6b50d_word2vec.txt
file, I get the following text, which matches the word2vec format. Additionally, if I count the occurrences of the string " - "
in the document, I find none.
['400000 50\n', 'the 0.418 0.24968 -0.41242 0.1217 0.34527 -0.044457 -0.49688 -0.17862 -0.00066023 -0.6566 0.27843 -0.14767 -0.55677 0.14658 -0.0095095 0.011658 0.10204 -0.12792 -0.8443 -0.12181 -0.016801 -0.33279 -0.1552 -0.23131 -0.19181 -1.8823 -0.76746 0.099051 -0.42125 -0.19526 4.0071 -0.18594 -0.52287 -0.31681 0.00059213 0.0074449 0.17778 -0.15897 0.012041 -0.054223 -0.29871 -0.15749 -0.34758 -0.045637 -0.44251 0.18785 0.0027849 -0.18411 -0.11514 -0.78581\n', ', 0.013441 0.23682 -0.16899 0.40951 0.63812 0.47709 -0.42852 -0.55641 -0.364 -0.23938 0.13001 -0.063734 -0.39575 -0.48162 0.23291 0.090201 -0.13324 0.078639 -0.41634 -0.15428 0.10068 0.48891 0.31226 -0.1252 -0.037512 -1.5179 0.12612 -0.02442 -0.042961 -0.28351 3.5416 -0.11956 -0.014533 -0.1499 0.21864 -0.33412 -0.13872 0.31806 0.70358 0.44858 -0.080262 0.63003 0.32111 -0.46765 0.22786 0.36034 -0.37818 -0.56657 0.044691 0.30392\n', '. 0.15164 0.30177 -0.16763 0.17684 0.31719 0.33973 -0.43478 -0.31086 -0.44999 -0.29486 0.16608 0.11963 -0.41328 -0.42353 0.59868 0.28825 -0.11547 -0.041848 -0.67989 -0.25063 0.18472 0.086876 0.46582 0.015035 0.043474 -1.4671 -0.30384 -0.023441 0.30589 -0.21785 3.746 0.0042284 -0.18436 -0.46209 0.098329 -0.11907 0.23919 0.1161 0.41705 0.056763 -6.3681e-05 0.068987 0.087939 -0.10285 -0.13931 0.22314 -0.080803 -0.35652 0.016413 0.10216\n', 'of 0.70853 0.57088 -0.4716 0.18048 0.54449 0.72603 0.18157 -0.52393 0.10381 -0.17566 0.078852 -0.36216 -0.11829 -0.83336 0.11917 -0.16605 0.061555 -0.012719 -0.56623 0.013616 0.22851 -0.14396 -0.067549 -0.38157 -0.23698 -1.7037 -0.86692 -0.26704 -0.2589 0.1767 3.8676 -0.1613 -0.13273 -0.68881 0.18444 0.0052464 -0.33874 -0.078956 0.24185 0.36576 -0.34727 0.28483 0.075693 -0.062178 -0.38988 0.22902 -0.21617 -0.22562 -0.093918 -0.80375\n', 'to 0.68047 -0.039263 0.30186 -0.17792 0.42962 0.032246 -0.41376 0.13228 -0.29847 -0.085253 0.17118 0.22419 -0.10046 -0.43653 0.33418 0.67846 0.057204 -0.34448 -0.42785 -0.43275 0.55963 0.10032 0.18677 -0.26854 0.037334 -2.0932 0.22171 -0.39868 0.20912 -0.55725 3.8826 0.47466 -0.95658 -0.37788 0.20869 -0.32752 0.12751 0.088359 0.16351 -0.21634 -0.094375 0.018324 0.21048 -0.03088 -0.19722 0.082279 -0.09434 -0.073297 -0.064699 -0.26044\n', 'and 0.26818 0.14346 -0.27877 0.016257 0.11384 0.69923 -0.51332 -0.47368 -0.33075 -0.13834 0.2702 0.30938 -0.45012 -0.4127 -0.09932 0.038085 0.029749 0.10076 -0.25058 -0.51818 0.34558 0.44922 0.48791 -0.080866 -0.10121 -1.3777 -0.10866 -0.23201 0.012839 -0.46508 3.8463 0.31362 0.13643 -0.52244 0.3302 0.33707 -0.35601 0.32431 0.12041 0.3512 -0.069043 0.36885 0.25168 -0.24517 0.25381 0.1367 -0.31178 -0.6321 -0.25028 -0.38097\n', 'in 0.33042 0.24995 -0.60874 0.10923 0.036372 0.151 -0.55083 -0.074239 -0.092307 -0.32821 0.09598 -0.82269 -0.36717 -0.67009 0.42909 0.016496 -0.23573 0.12864 -1.0953 0.43334 0.57067 -0.1036 0.20422 0.078308 -0.42795 -1.7984 -0.27865 0.11954 -0.12689 0.031744 3.8631 -0.17786 -0.082434 -0.62698 0.26497 -0.057185 -0.073521 0.46103 0.30862 0.12498 -0.48609 -0.0080272 0.031184 -0.36576 -0.42699 0.42164 -0.11666 -0.50703 -0.027273 -0.53285\n', 'a 0.21705 0.46515 -0.46757 0.10082 1.0135 0.74845 -0.53104 -0.26256 0.16812 0.13182 -0.24909 -0.44185 -0.21739 0.51004 0.13448 -0.43141 -0.03123 0.20674 -0.78138 -0.20148 -0.097401 0.16088 -0.61836 -0.18504 -0.12461 -2.2526 -0.22321 0.5043 0.32257 0.15313 3.9636 -0.71365 -0.67012 0.28388 0.21738 0.14433 0.25926 0.23434 0.4274 -0.44451 0.13813 0.36973 -0.64289 0.024142 -0.039315 -0.26037 0.12017 -0.043782 0.41013 0.1796\n', '" 0.25769 0.45629 -0.76974 -0.37679 0.59272 -0.063527 0.20545 -0.57385 -0.29009 -0.13662 0.32728 1.4719 -0.73681 -0.12036 0.71354 -0.46098 0.65248 0.48887 -0.51558 0.039951 -0.34307 -0.014087 0.86488 0.3546 0.7999 -1.4995 -1.8153 0.41128 0.23921 -0.43139 3.6623 -0.79834 -0.54538 0.16943 -0.82017 -0.3461 0.69495 -1.2256 -0.17992 -0.057474 0.030498 -0.39543 -0.38515 -1.0002 0.087599 -0.31009 -0.34677 -0.31438 0.75004 0.97065\n']
My search methods are evidently thusfar ineffective. Would really appreciate some help.
回答1:
In can't reproduce the problem running the following code (on a linux machine, Python 3.6):
In [1]: from gensim.models import KeyedVectors
In [2]: from gensim.scripts.glove2word2vec import glove2word2vec
In [3]: glove2word2vec('glove.6B.50d.txt', 'glove.68.50d.w2v.txt')
Out[3]: (400000, 50)
In [4]: model = KeyedVectors.load_word2vec_format('glove.68.50d.w2v.txt')
In [5]: len(model)
Out[5]: 400000
In [6]: model['the']
Out[7]:
array([ 4.1800e-01, 2.4968e-01, -4.1242e-01, 1.2170e-01, 3.4527e-01,
-4.4457e-02, -4.9688e-01, -1.7862e-01, -6.6023e-04, -6.5660e-01,
2.7843e-01, -1.4767e-01, -5.5677e-01, 1.4658e-01, -9.5095e-03,
1.1658e-02, 1.0204e-01, -1.2792e-01, -8.4430e-01, -1.2181e-01,
-1.6801e-02, -3.3279e-01, -1.5520e-01, -2.3131e-01, -1.9181e-01,
-1.8823e+00, -7.6746e-01, 9.9051e-02, -4.2125e-01, -1.9526e-01,
4.0071e+00, -1.8594e-01, -5.2287e-01, -3.1681e-01, 5.9213e-04,
7.4449e-03, 1.7778e-01, -1.5897e-01, 1.2041e-02, -5.4223e-02,
-2.9871e-01, -1.5749e-01, -3.4758e-01, -4.5637e-02, -4.4251e-01,
1.8785e-01, 2.7849e-03, -1.8411e-01, -1.1514e-01, -7.8581e-01],
dtype=float32)
Do these exact lines trigger the exact same error as originally reported for you? (If you still get an error, but the error is even the slightest bit different, can you add the updated error to your question?)
My best guess if you're still having a problem is some Windows-specific default-encoding mangling during one of the steps, or if the file was opened/saved in some other editor.
来源:https://stackoverflow.com/questions/61789168/glove6b50d-parsing-could-not-convert-string-to-float