nlp

Spacy nlp = spacy.load(“en_core_web_lg”)

℡╲_俬逩灬. 提交于 2021-01-21 03:48:07
问题 I already have spaCy downloaded, but everytime I try the nlp = spacy.load("en_core_web_lg") , command, I get this error: OSError: [E050] Can't find model 'en_core_web_lg'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory. I already tried >>> import spacy >>> nlp = spacy.load("en_core_web_sm") and this does not work like it would on my personal computer. My question is how do I work around this? What directory specifically do I need to drop the spacy

what does the vector of a word in word2vec represents?

南笙酒味 提交于 2021-01-20 14:17:50
问题 word2vec is a open source tool by Google: For each word it provides a vector of float values, what exactly do they represent? There is also a paper on paragraph vector can anyone explain how they are using word2vec in order to obtain fixed length vector for a paragraph. 回答1: TLDR : Word2Vec is building word projections ( embeddings ) in a latent space of N dimensions, (N being the size of the word vectors obtained). The float values represents the coordinates of the words in this N

what does the vector of a word in word2vec represents?

别说谁变了你拦得住时间么 提交于 2021-01-20 14:17:27
问题 word2vec is a open source tool by Google: For each word it provides a vector of float values, what exactly do they represent? There is also a paper on paragraph vector can anyone explain how they are using word2vec in order to obtain fixed length vector for a paragraph. 回答1: TLDR : Word2Vec is building word projections ( embeddings ) in a latent space of N dimensions, (N being the size of the word vectors obtained). The float values represents the coordinates of the words in this N

what does the vector of a word in word2vec represents?

僤鯓⒐⒋嵵緔 提交于 2021-01-20 14:17:22
问题 word2vec is a open source tool by Google: For each word it provides a vector of float values, what exactly do they represent? There is also a paper on paragraph vector can anyone explain how they are using word2vec in order to obtain fixed length vector for a paragraph. 回答1: TLDR : Word2Vec is building word projections ( embeddings ) in a latent space of N dimensions, (N being the size of the word vectors obtained). The float values represents the coordinates of the words in this N

Fine-tune Bert for specific domain (unsupervised)

孤人 提交于 2021-01-20 08:39:56
问题 I want to fine-tune BERT on texts that are related to a specific domain (in my case related to engineering). The training should be unsupervised since I don't have any labels or anything. Is this possible? 回答1: What you in fact want to is continue pre-training BERT on text from your specific domain. What you do in this case is to continue training the model as masked language model, but on your domain-specific data. You can use the run_mlm.py script from the Huggingface's Transformers. 来源:

Fine-tune Bert for specific domain (unsupervised)

自古美人都是妖i 提交于 2021-01-20 08:39:28
问题 I want to fine-tune BERT on texts that are related to a specific domain (in my case related to engineering). The training should be unsupervised since I don't have any labels or anything. Is this possible? 回答1: What you in fact want to is continue pre-training BERT on text from your specific domain. What you do in this case is to continue training the model as masked language model, but on your domain-specific data. You can use the run_mlm.py script from the Huggingface's Transformers. 来源:

How to use Bert for long text classification?

…衆ロ難τιáo~ 提交于 2021-01-14 04:14:19
问题 We know that bert has a max length limit of tokens = 512, So if an acticle has a length of much bigger than 512, such as 10000 tokens in text How can bert be used? 回答1: You have basically three options: You cut the longer texts off and only use the first 512 Tokens. The original BERT implementation (and probably the others as well) truncates longer sequences automatically. For most cases, this option is sufficient. You can split your text in multiple subtexts, classifier each of them and

How to use Bert for long text classification?

五迷三道 提交于 2021-01-14 04:08:45
问题 We know that bert has a max length limit of tokens = 512, So if an acticle has a length of much bigger than 512, such as 10000 tokens in text How can bert be used? 回答1: You have basically three options: You cut the longer texts off and only use the first 512 Tokens. The original BERT implementation (and probably the others as well) truncates longer sequences automatically. For most cases, this option is sufficient. You can split your text in multiple subtexts, classifier each of them and

How to use Bert for long text classification?

人盡茶涼 提交于 2021-01-14 04:07:43
问题 We know that bert has a max length limit of tokens = 512, So if an acticle has a length of much bigger than 512, such as 10000 tokens in text How can bert be used? 回答1: You have basically three options: You cut the longer texts off and only use the first 512 Tokens. The original BERT implementation (and probably the others as well) truncates longer sequences automatically. For most cases, this option is sufficient. You can split your text in multiple subtexts, classifier each of them and

After training word embedding with gensim's fasttext's wrapper, how to embed new sentences?

浪子不回头ぞ 提交于 2021-01-07 03:56:25
问题 After reading the tutorial at gensim's docs, I do not understand what is the correct way of generating new embeddings from a trained model. So far I have trained gensim's fast text embeddings like this: from gensim.models.fasttext import FastText as FT_gensim model_gensim = FT_gensim(size=100) # build the vocabulary model_gensim.build_vocab(corpus_file=corpus_file) # train the model model_gensim.train( corpus_file=corpus_file, epochs=model_gensim.epochs, total_examples=model_gensim.corpus