nlp | 易学教程

SpaCy 'nlp.to_disk' is not saving to disk

阅读更多关于 SpaCy 'nlp.to_disk' is not saving to disk

问题 I am trying to figure out why my custom SpaCy NER model isn't saving to disk using nlp.to_disk . I am using this condition in my python script: # save model to output directory if output_dir is not None: output_dir = Path(output_dir) if not output_dir.exists(): output_dir.mkdir() nlp.to_disk(output_dir) print("Saved model to", output_dir) The output_dir is defined at the top of my script as: @plac.annotations( model=("Model name. Defaults to blank 'en' model.", "option", "m", str), output_dir

An NLP Model that Suggest a List of Words in an Incomplete Sentence

阅读更多关于 An NLP Model that Suggest a List of Words in an Incomplete Sentence

问题 I have somewhat read a bunch of papers which talks about predicting missing words in a sentence. What I really want is to create a model that suggest a word from an incomplete sentence. Example: Incomplete Sentence : I bought an ___________ because its rainy. Suggested Words: umbrella soup jacket From the journal I have read, they have utilized Microsoft Sentence Completion Dataset for predicting missing words from a sentence. Example : Incomplete Sentence : Im sad because you are __________

How to use tokenized sentence as input for Spacy's PoS tagger?

阅读更多关于 How to use tokenized sentence as input for Spacy's PoS tagger?

问题 Spacy's pos tagger is really convenient, it can directly tag on raw sentence. import spacy sp = spacy.load('en_core_web_sm') sen = sp(u"I am eating") But I'm using tokenizer from nltk . So how to use a tokenized sentence like ['I', 'am', 'eating'] rather than 'I am eating' for the Spacy's tagger? BTW, where can I found detailed Spacy documentation? I can only find an overview on the official website Thanks. 回答1: There's two options: You write a wrapper around the nltk tokenizer and use it to

Understanding gensim word2vec's most_similar

阅读更多关于 Understanding gensim word2vec's most_similar

问题 I am unsure how I should use the most_similar method of gensim's Word2Vec. Let's say you want to test the tried-and-true example of: man stands to king as woman stands to X ; find X. I thought that is what you could do with this method, but from the results I am getting I don't think that is true. The documentation reads: Find the top-N most similar words. Positive words contribute positively towards the similarity, negative words negatively. This method computes cosine similarity between a

Understanding gensim word2vec's most_similar

阅读更多关于 Understanding gensim word2vec's most_similar

Spacy tokenizer, add tokenizer exception

阅读更多关于 Spacy tokenizer, add tokenizer exception

问题 Hey! I am trying to add an exception at tokenizing some tokens using spacy 2.02, I know that exists .tokenizer.add_special_case() which I am using for some cases but for example a token like US$100, spacy splits in two token ('US$', 'SYM'), ('100', 'NUM') But I want to split in three like this, instead of doing a special case for each number after the us$, i want to make an excpetion for every token that has a forma of US$NUMBER. ('US', 'PROPN'), ('$', 'SYM'), ('800', 'NUM') I was reading

How to stop BERT from breaking apart specific words into word-piece

阅读更多关于 How to stop BERT from breaking apart specific words into word-piece

问题 I am using a pre-trained BERT model to tokenize a text into meaningful tokens. However, the text has many specific words and I don't want BERT model to break them into word-pieces. Is there any solution to it? For example: tokenizer = BertTokenizer('bert-base-uncased-vocab.txt') tokens = tokenizer.tokenize("metastasis") Create tokens like this: ['meta', '##sta', '##sis'] However, I want to keep the whole words as one token, like this: ['metastasis'] 回答1: You are free to add new tokens to the

Finding Similarity between 2 sentences using word2vec of sentence with python

阅读更多关于 Finding Similarity between 2 sentences using word2vec of sentence with python

问题 I want to calculate the similarity between two sentences using word2vectors, I am trying to get the vectors of a sentence so that i can calculate the average of a sentence vectors to find the cosine similarity. i have tried this code but its not working. the output it gives the sentence-vectors with ones. i want the actual vectors of sentences in sentence_1_avg_vector & sentence_2_avg_vector. Code: #DataSet# sent1=[['What', 'step', 'step', 'guide', 'invest', 'share', 'market', 'india'],['What

SQL word root matching

阅读更多关于 SQL word root matching

问题 I'm wondering whether major SQL engines out there (MS SQL, Oracle, MySQL) have the ability to understand that 2 words are related because they share the same root. We know it's easy to match "networking" when searching for "network" because the latter is a substring of the former. But do SQL engines have functions that can match "network" when searching for "networking"? Thanks a lot. 回答1: This functionality is called a stemmer: an algorithm that can deduce a stem from any form of the word.

Sparse Efficiency Warning while changing the column

阅读更多关于 Sparse Efficiency Warning while changing the column

问题 def tdm_modify(feature_names,tdm): non_useful_words=['kill','stampede','trigger','cause','death','hospital'\ ,'minister','said','told','say','injury','victim','report'] indexes=[feature_names.index(word) for word in non_useful_words] for index in indexes: tdm[:,index]=0 return tdm I want to manually set zero weights for some terms in tdm matrix. Using the above code I get the warning. I don't seem to understand why? Is there a better way to do this? C:\Anaconda\lib\site-packages\scipy\sparse