nlp

Second-order cooccurrence of terms in texts

半世苍凉 提交于 2021-02-08 08:35:18
问题 Basically, I want to reimplement this video. Given a corpus of documents, I want to find the terms that are most similar to each other. I was able to generate a cooccurrence matrix using this SO thread and use the video to generate an association matrix. Next I, would like to generate a second order cooccurrence matrix. Problem statement: Consider a matrix where the rows of the matrix correspond to a term and the entries in the rows correspond to the top k terms similar to that term. Say, k =

Modify NLTK word_tokenize to prevent tokenization of parenthesis

巧了我就是萌 提交于 2021-02-08 07:32:48
问题 I have the following main.py . #!/usr/bin/env python # vim: set noexpandtab tabstop=2 shiftwidth=2 softtabstop=-1 fileencoding=utf-8: import nltk import string import sys for token in nltk.word_tokenize(''.join(sys.stdin.readlines())): #print token if len(token) == 1 and not token in string.punctuation or len(token) > 1: print token The output is the following. ./main.py <<< 'EGR1(-/-) mouse embryonic fibroblasts' EGR1 -/- mouse embryonic fibroblasts I want to slightly change the tokenizer so

Is there a Python library or tool that analyzes two bodies of text for similarities in order to provide recommendations?

别来无恙 提交于 2021-02-08 06:38:36
问题 First, apologies for being long-winded. I'm not a mathematician, so I'm hoping there's a "dumbed down" solution to this. In short, I'm attempting to compare two bodies of text to generate recommendations. What you'll see below is a novice attempt at measuring similarity using NLP. I'm open to all feedback. But my primary question: does the method described below serve as an accurate means of finding similarities (in wording, sentiment, etc) in two bodies of text? If not, how would you

Regular expressions in POS tagged NLTK corpus

荒凉一梦 提交于 2021-02-08 06:29:14
问题 I'm loading a POS-tagged corpus in NLTK, and I would like to find certain patterns involving POS tags. These patterns can be quite complex, including a lot of different combinations of POS tags. Example input string: We/PRP spent/VBD some/DT time/NN reading/NN about/IN the/DT historical/JJ importance/NN of/IN tea/NN in/IN Korea/NNP and/CC China/NNP and/CC then/RB tasted/VBD the/DT most/JJS expensive/JJ green/JJ tea/NN I/PRP have/VBP ever/RB seen/VBN ./. In this case the POS pattern is

Automatic whois data parsing

不羁岁月 提交于 2021-02-08 05:21:12
问题 I need to parse WHOIS raw data records into fields. There is no one consistent format for the raw data, and I need to support all the possible formats (there are ~ 40 unique formats that I know of). For examples, here are excerpts from 3 different WHOIS raw data records: Created on: 2007-01-04 Updated on: 2014-01-29 Expires on: 2015-01-04 Registrant Name: 0,75 DI VALENTINO ROSSI Contact: 0,75 Di Valentino Rossi Registrant Address: Via Garibaldi 22 Registrant City: Pradalunga Registrant Postal

Logistic regression: X has 667 features per sample; expecting 74869

元气小坏坏 提交于 2021-02-08 05:16:38
问题 Using a imdb movie reviews dataset i have made a logistic regression to predict the sentiment of the review. tfidf = TfidfVectorizer(strip_accents=None, lowercase=False, preprocessor=None, tokenizer=fill, use_idf=True, norm='l2', smooth_idf=True) y = df.sentiment.values X = tfidf.fit_transform(df.review) X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1, test_size=0.3, shuffle=False) clf = LogisticRegressionCV(cv=5, scoring="accuracy", random_state=1, n_jobs=-1, verbose

tf-idf on a somewhat large (65k) amount of text files

十年热恋 提交于 2021-02-08 04:45:37
问题 I want to try tfidf with scikit-learn (or nltk or am open to other suggestions). The data I have is a relatively large amount of discussion forum posts (~65k) we have scraped and stored in a mongoDB. Each post has a Post title, Date and Time of post, Text of the post message (or a re: if a reply to an existing post), User name, message ID and whether it is a child or parent post (in a thread, where you have the original post, and then replies to this op, or nested replies, the tree). I figure

Custom sentence boundary detection in SpaCy

六眼飞鱼酱① 提交于 2021-02-08 01:51:24
问题 I'm trying to write a custom sentence segmenter in spaCy that returns the whole document as a single sentence. I wrote a custom pipeline component that does it using the code from here. I can't get it to work though, because instead of changing the sentence boundaries to take the whole document as a single sentence it throws two different errors. If I create a blank language instance and only add my custom component to the pipeline I get this error: ValueError: Sentence boundary detection

Custom sentence boundary detection in SpaCy

让人想犯罪 __ 提交于 2021-02-08 01:50:57
问题 I'm trying to write a custom sentence segmenter in spaCy that returns the whole document as a single sentence. I wrote a custom pipeline component that does it using the code from here. I can't get it to work though, because instead of changing the sentence boundaries to take the whole document as a single sentence it throws two different errors. If I create a blank language instance and only add my custom component to the pipeline I get this error: ValueError: Sentence boundary detection

Remove accents and keep under dots in Python

↘锁芯ラ 提交于 2021-02-07 19:30:16
问题 I am working on an NLP task that requires using a corpus of the language called Yoruba. Yoruba is a language that has diacritics (accents) and under dots in its alphabets. For instance, this is a Yoruba string: "ọmọàbúròẹlẹ́wà" , and I need to remove the accents and keep the under dots. I have tried using the unidecode library in Python, but it removes accents and under dots. import unidecode ac_stng = "ọmọàbúròẹlẹ́wà" unac_stng = unidecode.unidecode(ac_stng) I expect the output to be