nlp

SpaCy 'nlp.to_disk' is not saving to disk

若如初见. 提交于 2021-01-28 12:00:42
问题 I am trying to figure out why my custom SpaCy NER model isn't saving to disk using nlp.to_disk . I am using this condition in my python script: # save model to output directory if output_dir is not None: output_dir = Path(output_dir) if not output_dir.exists(): output_dir.mkdir() nlp.to_disk(output_dir) print("Saved model to", output_dir) The output_dir is defined at the top of my script as: @plac.annotations( model=("Model name. Defaults to blank 'en' model.", "option", "m", str), output_dir

An NLP Model that Suggest a List of Words in an Incomplete Sentence

℡╲_俬逩灬. 提交于 2021-01-28 11:58:16
问题 I have somewhat read a bunch of papers which talks about predicting missing words in a sentence. What I really want is to create a model that suggest a word from an incomplete sentence. Example: Incomplete Sentence : I bought an ___________ because its rainy. Suggested Words: umbrella soup jacket From the journal I have read, they have utilized Microsoft Sentence Completion Dataset for predicting missing words from a sentence. Example : Incomplete Sentence : Im sad because you are __________

How to use tokenized sentence as input for Spacy's PoS tagger?

不想你离开。 提交于 2021-01-28 11:00:23
问题 Spacy's pos tagger is really convenient, it can directly tag on raw sentence. import spacy sp = spacy.load('en_core_web_sm') sen = sp(u"I am eating") But I'm using tokenizer from nltk . So how to use a tokenized sentence like ['I', 'am', 'eating'] rather than 'I am eating' for the Spacy's tagger? BTW, where can I found detailed Spacy documentation? I can only find an overview on the official website Thanks. 回答1: There's two options: You write a wrapper around the nltk tokenizer and use it to

Understanding gensim word2vec's most_similar

為{幸葍}努か 提交于 2021-01-28 10:50:30
问题 I am unsure how I should use the most_similar method of gensim's Word2Vec. Let's say you want to test the tried-and-true example of: man stands to king as woman stands to X ; find X. I thought that is what you could do with this method, but from the results I am getting I don't think that is true. The documentation reads: Find the top-N most similar words. Positive words contribute positively towards the similarity, negative words negatively. This method computes cosine similarity between a

Understanding gensim word2vec's most_similar

一世执手 提交于 2021-01-28 10:48:53
问题 I am unsure how I should use the most_similar method of gensim's Word2Vec. Let's say you want to test the tried-and-true example of: man stands to king as woman stands to X ; find X. I thought that is what you could do with this method, but from the results I am getting I don't think that is true. The documentation reads: Find the top-N most similar words. Positive words contribute positively towards the similarity, negative words negatively. This method computes cosine similarity between a

Spacy tokenizer, add tokenizer exception

早过忘川 提交于 2021-01-28 09:55:04
问题 Hey! I am trying to add an exception at tokenizing some tokens using spacy 2.02, I know that exists .tokenizer.add_special_case() which I am using for some cases but for example a token like US$100, spacy splits in two token ('US$', 'SYM'), ('100', 'NUM') But I want to split in three like this, instead of doing a special case for each number after the us$, i want to make an excpetion for every token that has a forma of US$NUMBER. ('US', 'PROPN'), ('$', 'SYM'), ('800', 'NUM') I was reading

How to stop BERT from breaking apart specific words into word-piece

ぃ、小莉子 提交于 2021-01-28 06:06:29
问题 I am using a pre-trained BERT model to tokenize a text into meaningful tokens. However, the text has many specific words and I don't want BERT model to break them into word-pieces. Is there any solution to it? For example: tokenizer = BertTokenizer('bert-base-uncased-vocab.txt') tokens = tokenizer.tokenize("metastasis") Create tokens like this: ['meta', '##sta', '##sis'] However, I want to keep the whole words as one token, like this: ['metastasis'] 回答1: You are free to add new tokens to the

Finding Similarity between 2 sentences using word2vec of sentence with python

血红的双手。 提交于 2021-01-27 11:52:30
问题 I want to calculate the similarity between two sentences using word2vectors, I am trying to get the vectors of a sentence so that i can calculate the average of a sentence vectors to find the cosine similarity. i have tried this code but its not working. the output it gives the sentence-vectors with ones. i want the actual vectors of sentences in sentence_1_avg_vector & sentence_2_avg_vector. Code: #DataSet# sent1=[['What', 'step', 'step', 'guide', 'invest', 'share', 'market', 'india'],['What

SQL word root matching

夙愿已清 提交于 2021-01-27 07:41:50
问题 I'm wondering whether major SQL engines out there (MS SQL, Oracle, MySQL) have the ability to understand that 2 words are related because they share the same root. We know it's easy to match "networking" when searching for "network" because the latter is a substring of the former. But do SQL engines have functions that can match "network" when searching for "networking"? Thanks a lot. 回答1: This functionality is called a stemmer: an algorithm that can deduce a stem from any form of the word.

Sparse Efficiency Warning while changing the column

狂风中的少年 提交于 2021-01-27 02:57:47
问题 def tdm_modify(feature_names,tdm): non_useful_words=['kill','stampede','trigger','cause','death','hospital'\ ,'minister','said','told','say','injury','victim','report'] indexes=[feature_names.index(word) for word in non_useful_words] for index in indexes: tdm[:,index]=0 return tdm I want to manually set zero weights for some terms in tdm matrix. Using the above code I get the warning. I don't seem to understand why? Is there a better way to do this? C:\Anaconda\lib\site-packages\scipy\sparse