nlp | 易学教程

Second-order cooccurrence of terms in texts

阅读更多关于 Second-order cooccurrence of terms in texts

问题 Basically, I want to reimplement this video. Given a corpus of documents, I want to find the terms that are most similar to each other. I was able to generate a cooccurrence matrix using this SO thread and use the video to generate an association matrix. Next I, would like to generate a second order cooccurrence matrix. Problem statement: Consider a matrix where the rows of the matrix correspond to a term and the entries in the rows correspond to the top k terms similar to that term. Say, k =

Modify NLTK word_tokenize to prevent tokenization of parenthesis

阅读更多关于 Modify NLTK word_tokenize to prevent tokenization of parenthesis

问题 I have the following main.py . #!/usr/bin/env python # vim: set noexpandtab tabstop=2 shiftwidth=2 softtabstop=-1 fileencoding=utf-8: import nltk import string import sys for token in nltk.word_tokenize(''.join(sys.stdin.readlines())): #print token if len(token) == 1 and not token in string.punctuation or len(token) > 1: print token The output is the following. ./main.py <<< 'EGR1(-/-) mouse embryonic fibroblasts' EGR1 -/- mouse embryonic fibroblasts I want to slightly change the tokenizer so

Is there a Python library or tool that analyzes two bodies of text for similarities in order to provide recommendations?

阅读更多关于 Is there a Python library or tool that analyzes two bodies of text for similarities in order to provide recommendations?

问题 First, apologies for being long-winded. I'm not a mathematician, so I'm hoping there's a "dumbed down" solution to this. In short, I'm attempting to compare two bodies of text to generate recommendations. What you'll see below is a novice attempt at measuring similarity using NLP. I'm open to all feedback. But my primary question: does the method described below serve as an accurate means of finding similarities (in wording, sentiment, etc) in two bodies of text? If not, how would you

Regular expressions in POS tagged NLTK corpus

阅读更多关于 Regular expressions in POS tagged NLTK corpus

问题 I'm loading a POS-tagged corpus in NLTK, and I would like to find certain patterns involving POS tags. These patterns can be quite complex, including a lot of different combinations of POS tags. Example input string: We/PRP spent/VBD some/DT time/NN reading/NN about/IN the/DT historical/JJ importance/NN of/IN tea/NN in/IN Korea/NNP and/CC China/NNP and/CC then/RB tasted/VBD the/DT most/JJS expensive/JJ green/JJ tea/NN I/PRP have/VBP ever/RB seen/VBN ./. In this case the POS pattern is

Automatic whois data parsing

阅读更多关于 Automatic whois data parsing

问题 I need to parse WHOIS raw data records into fields. There is no one consistent format for the raw data, and I need to support all the possible formats (there are ~ 40 unique formats that I know of). For examples, here are excerpts from 3 different WHOIS raw data records: Created on: 2007-01-04 Updated on: 2014-01-29 Expires on: 2015-01-04 Registrant Name: 0,75 DI VALENTINO ROSSI Contact: 0,75 Di Valentino Rossi Registrant Address: Via Garibaldi 22 Registrant City: Pradalunga Registrant Postal

Logistic regression: X has 667 features per sample; expecting 74869

阅读更多关于 Logistic regression: X has 667 features per sample; expecting 74869

问题 Using a imdb movie reviews dataset i have made a logistic regression to predict the sentiment of the review. tfidf = TfidfVectorizer(strip_accents=None, lowercase=False, preprocessor=None, tokenizer=fill, use_idf=True, norm='l2', smooth_idf=True) y = df.sentiment.values X = tfidf.fit_transform(df.review) X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1, test_size=0.3, shuffle=False) clf = LogisticRegressionCV(cv=5, scoring="accuracy", random_state=1, n_jobs=-1, verbose

tf-idf on a somewhat large (65k) amount of text files

阅读更多关于 tf-idf on a somewhat large (65k) amount of text files

问题 I want to try tfidf with scikit-learn (or nltk or am open to other suggestions). The data I have is a relatively large amount of discussion forum posts (~65k) we have scraped and stored in a mongoDB. Each post has a Post title, Date and Time of post, Text of the post message (or a re: if a reply to an existing post), User name, message ID and whether it is a child or parent post (in a thread, where you have the original post, and then replies to this op, or nested replies, the tree). I figure

Custom sentence boundary detection in SpaCy

阅读更多关于 Custom sentence boundary detection in SpaCy

问题 I'm trying to write a custom sentence segmenter in spaCy that returns the whole document as a single sentence. I wrote a custom pipeline component that does it using the code from here. I can't get it to work though, because instead of changing the sentence boundaries to take the whole document as a single sentence it throws two different errors. If I create a blank language instance and only add my custom component to the pipeline I get this error: ValueError: Sentence boundary detection

Custom sentence boundary detection in SpaCy

阅读更多关于 Custom sentence boundary detection in SpaCy

Remove accents and keep under dots in Python

阅读更多关于 Remove accents and keep under dots in Python

问题 I am working on an NLP task that requires using a corpus of the language called Yoruba. Yoruba is a language that has diacritics (accents) and under dots in its alphabets. For instance, this is a Yoruba string: "ọmọàbúròẹlẹ́wà" , and I need to remove the accents and keep the under dots. I have tried using the unidecode library in Python, but it removes accents and under dots. import unidecode ac_stng = "ọmọàbúròẹlẹ́wà" unac_stng = unidecode.unidecode(ac_stng) I expect the output to be