nltk | 易学教程

with NLTK, How can I generate different form of word, when a certain word is given?

阅读更多关于 with NLTK, How can I generate different form of word, when a certain word is given?

问题 For example, Suppose the word "happy" is given, I want to generate other forms of happy such as happiness, happily... etc. I have read some other previous questions on Stackoverflow and NLTK references. However, there are only POS tagging, morph just like identifying the grammatical form of certain words within sentences, not generating a list of different words. Is there anyone who bumped into similar issues? Thank you. 回答1: This type of information is included in the Lemma class of NLTK's

NLP任务中的文本预处理步骤、工具和示例

阅读更多关于 NLP任务中的文本预处理步骤、工具和示例

数据是新的石油，文本是我们需要更深入钻探的油井。文本数据无处不在，在实际使用之前，我们必须对其进行预处理，以使其适合我们的需求。对于数据也是如此，我们必须清理和预处理数据以符合我们的目的。这篇文章将包括一些简单的方法来清洗和预处理文本数据以进行文本分析任务。我们将在Covid-19 Twitter数据集上对该方法进行建模。这种方法有3个主要组成部分：首先，我们要清理和过滤所有非英语的推文/文本，因为我们希望数据保持一致。其次，我们为复杂的文本数据创建一个简化的版本。最后，我们将文本向量化并保存其嵌入以供将来分析。第1部分:清理和过滤文本首先，为了简化文本，我们要将文本标准化为仅为英文字符。此函数将删除所有非英语字符。 def clean_non_english(txt): txt = re.sub(r'\W+', ' ', txt) txt = txt.lower() txt = txt.replace("[^a-zA-Z]", " ") word_tokens = word_tokenize(txt) filtered_word = [w for w in word_tokens if all(ord(c) < 128 for c in w)] filtered_word = [w + " " for w in filtered_word] return ""

extract name entities and its corresponding numerical values from sentence

阅读更多关于 extract name entities and its corresponding numerical values from sentence

问题 I want to extract information from sentences. Currently, I am able to do the following using spacy. Amy's monthly payment is $2000. --> (Amy's monthly payment, $2000) However, I am trying to do the following. The monthly payments for Amy, Bob, and Eva are $2000, $3000 and $3500 respectively. --> ((Amy's monthly payment, $2000), (Bob's monthly payment, $3000), (Eva's monthly payment, $3500)) Is there any way that I can perform the task using the NLP method through python library such as Spacy?

extract name entities and its corresponding numerical values from sentence

阅读更多关于 extract name entities and its corresponding numerical values from sentence

Ntlk & Python, plotting ROC curve

阅读更多关于 Ntlk & Python, plotting ROC curve

问题 I am using nltk with Python and I would like to plot the ROC curve of my classifier (Naive Bayes). Is there any function for plotting it or should I have to track the True Positive rate and False Positive rate ? It would be great if someone would point me to some code already doing it... Thanks. 回答1: PyROC looks simple enough: tutorial, source code This is how it would work with the NLTK naive bayes classifier: # class labels are 0 and 1 labeled_data = [ (1, featureset_1), (0, featureset_2),

Printing the part of speech along with the synonyms of the word

阅读更多关于 Printing the part of speech along with the synonyms of the word

问题 I have the following code for taking a word from the input text file and printing the synonyms, definitions and example sentences for the word using WordNet. It separates the synonyms from the synset based on the part-of-speech, i.e., the synonyms that are verbs and the synonyms that are adjectives are printed separately. Example for the word flabbergasted the synonyms are 1) flabbergast , boggle , bowl over which are verbs and 2)dumbfounded , dumfounded , flabbergasted , stupefied ,

Printing the part of speech along with the synonyms of the word

阅读更多关于 Printing the part of speech along with the synonyms of the word

How to identify character encoding from website?

阅读更多关于 How to identify character encoding from website?

问题 What I'm trying to do: I'm getting from a database a list of uris and download them, removing the stopwords and counting the frequency that the words appears in the webpage, then trying to save in the mongodb. The Problem: When I try to save the result in the database I get the error bson.errors.invalidDocument: the document must be a valid utf-8 it appears to be related to the codes '\xc3someotherstrangewords', '\xe2something' when I'm processing the webpages I try remove the punctuation,

NLTK tokenizer and Stanford corenlp tokenizer cannot distinct 2 sentences without space at period (.)

阅读更多关于 NLTK tokenizer and Stanford corenlp tokenizer cannot distinct 2 sentences without space at period (.)

问题 I have 2 sentences in my dataset: w1 = I am Pusheen the cat.I am so cute. # no space after period w2 = I am Pusheen the cat. I am so cute. # with space after period When I use NKTL tokenizer (both word and sent), nltk cannot distinct the between cat.I. Here is word tokenize >>> nltk.word_tokenize(w1, 'english') ['I', 'am', 'Pusheen', 'the', 'cat.I', 'am', 'so', 'cute'] >>> nltk.word_tokenize(w2, 'english') ['I', 'am', 'Pusheen', 'the', 'cat', '.', 'I', 'am', 'so', 'cute'] and sent tokenize >>

NLTK tokenizer and Stanford corenlp tokenizer cannot distinct 2 sentences without space at period (.)

阅读更多关于 NLTK tokenizer and Stanford corenlp tokenizer cannot distinct 2 sentences without space at period (.)