nlp | 易学教程

Spacy nlp = spacy.load(“en_core_web_lg”)

阅读更多关于 Spacy nlp = spacy.load(“en_core_web_lg”)

问题 I already have spaCy downloaded, but everytime I try the nlp = spacy.load("en_core_web_lg") , command, I get this error: OSError: [E050] Can't find model 'en_core_web_lg'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory. I already tried >>> import spacy >>> nlp = spacy.load("en_core_web_sm") and this does not work like it would on my personal computer. My question is how do I work around this? What directory specifically do I need to drop the spacy

what does the vector of a word in word2vec represents?

阅读更多关于 what does the vector of a word in word2vec represents?

问题 word2vec is a open source tool by Google: For each word it provides a vector of float values, what exactly do they represent? There is also a paper on paragraph vector can anyone explain how they are using word2vec in order to obtain fixed length vector for a paragraph. 回答1: TLDR : Word2Vec is building word projections ( embeddings ) in a latent space of N dimensions, (N being the size of the word vectors obtained). The float values represents the coordinates of the words in this N

what does the vector of a word in word2vec represents?

阅读更多关于 what does the vector of a word in word2vec represents?

what does the vector of a word in word2vec represents?

阅读更多关于 what does the vector of a word in word2vec represents?

Fine-tune Bert for specific domain (unsupervised)

阅读更多关于 Fine-tune Bert for specific domain (unsupervised)

问题 I want to fine-tune BERT on texts that are related to a specific domain (in my case related to engineering). The training should be unsupervised since I don't have any labels or anything. Is this possible? 回答1: What you in fact want to is continue pre-training BERT on text from your specific domain. What you do in this case is to continue training the model as masked language model, but on your domain-specific data. You can use the run_mlm.py script from the Huggingface's Transformers. 来源：

Fine-tune Bert for specific domain (unsupervised)

阅读更多关于 Fine-tune Bert for specific domain (unsupervised)

How to use Bert for long text classification?

阅读更多关于 How to use Bert for long text classification?

问题 We know that bert has a max length limit of tokens = 512, So if an acticle has a length of much bigger than 512, such as 10000 tokens in text How can bert be used? 回答1: You have basically three options: You cut the longer texts off and only use the first 512 Tokens. The original BERT implementation (and probably the others as well) truncates longer sequences automatically. For most cases, this option is sufficient. You can split your text in multiple subtexts, classifier each of them and

How to use Bert for long text classification?

阅读更多关于 How to use Bert for long text classification?

How to use Bert for long text classification?

阅读更多关于 How to use Bert for long text classification?

After training word embedding with gensim's fasttext's wrapper, how to embed new sentences?

阅读更多关于 After training word embedding with gensim's fasttext's wrapper, how to embed new sentences?

问题 After reading the tutorial at gensim's docs, I do not understand what is the correct way of generating new embeddings from a trained model. So far I have trained gensim's fast text embeddings like this: from gensim.models.fasttext import FastText as FT_gensim model_gensim = FT_gensim(size=100) # build the vocabulary model_gensim.build_vocab(corpus_file=corpus_file) # train the model model_gensim.train( corpus_file=corpus_file, epochs=model_gensim.epochs, total_examples=model_gensim.corpus