text-analysis | 易学教程

Categories Busineesses with Text analytics in Python

阅读更多关于 Categories Busineesses with Text analytics in Python

问题 I'm a new-bee to AI and want to perform the below exercise. Can you please suggest the way to achieve it using python: Scenario - I have list of businesses of some companies as below like: 1. AI 2. Artificial Intelligence 3. VR 4. Virtual reality 5. Mobile application 6. Desktop softwares and want to categorize them as below: Technology ---> Category 1. AI ---> Category Artificial Intelligence 2. Artificial Intelligence ---> Category Artificial Intelligence 3. VR ---> Category Virtual Reality

ValueError: Found arrays with inconsistent numbers of samples [ 6 1786]

阅读更多关于 ValueError: Found arrays with inconsistent numbers of samples [ 6 1786]

问题 Here is my code: from sklearn.svm import SVC from sklearn.grid_search import GridSearchCV from sklearn.cross_validation import KFold from sklearn.feature_extraction.text import TfidfVectorizer from sklearn import datasets import numpy as np newsgroups = datasets.fetch_20newsgroups( subset='all', categories=['alt.atheism', 'sci.space'] ) X = newsgroups.data y = newsgroups.target TD_IF = TfidfVectorizer() y_scaled = TD_IF.fit_transform(newsgroups, y) grid = {'C': np.power(10.0, np.arange(-5, 6)

How to detect duplicates among text documents and return the duplicates' similarity?

阅读更多关于 How to detect duplicates among text documents and return the duplicates' similarity?

问题 I'm writing a crawler to get content from some website, but the content can duplicated, I want to avoid that. So I need a function can return the same percent between two text to detect two content maybe duplicated Example: Text 1:"I'm writing a crawler to" Text 2:"I'm writing a some text crawler to get" The compare function will return text 2 as the same text 1 by 5/8%(with 5 is words number of text 2 same text 1(compare by word order), and 8 is total words of text 2). If remove the "some

Extract relevant text from a .txt file in R

阅读更多关于 Extract relevant text from a .txt file in R

问题 I am still on a basic beginner level with r. I am currently working on some natural language stuff and I use the ProQuest Newsstand database. Even though the database allows to download txt files, I don't need everything they provide. The files you can download there look like this: ############################################################################### ____________________________________________________________ Report Information from ProQuest 16 July 2016 09:58 ____________________

Extract relevant text from a .txt file in R

阅读更多关于 Extract relevant text from a .txt file in R

How to extract n-gram word sequences from text in Postgres

阅读更多关于 How to extract n-gram word sequences from text in Postgres

问题 I am hoping to use Postgres to extract sequences of words from Text. For example the whole word trigrams for the following sentence "ed ut perspiciatis, unde omnis iste natus error sit voluptatem accusantium" would be "ed ut perspiciatis" "ut perspiciatis unde" "perspiciatis unde omnis" ... I have been doing this with R but I am hoping Postgres would be able to handle it more efficiently. I have seen a similar question asked here n-grams from text in PostgreSQL but I don't understand how to

Create sentence (row) to POS tags counts (column) matrix from a dataframe

阅读更多关于 Create sentence (row) to POS tags counts (column) matrix from a dataframe

问题 I am trying to build a matrix where the first row will be a part of speech, first column a sentence. values in the matrix should show the number of such POS in a sentence. So I am creating POS tags in this way: data = pd.read_csv(open('myfile.csv'),sep=';') target = data["label"] del data["label"] data.sentence = data.sentence.str.lower() # All strings in data frame to lowercase for line in data.sentence: Line_new= nltk.pos_tag(nltk.word_tokenize(line)) print(Line_new) The output is: [(

R Text Mining with quanteda

阅读更多关于 R Text Mining with quanteda

问题 I have a data set (Facebook posts) (via netvizz) and I use the quanteda package in R. Here is my R code. # Load the relevant dictionary (relevant for analysis) liwcdict <- dictionary(file = "D:/LIWC2001_English.dic", format = "LIWC") # Read File # Facebooks posts could be generated by FB Netvizz # https://apps.facebook.com/netvizz # Load FB posts as .csv-file from .zip-file fbpost <- read.csv("D:/FB-com.csv", sep=";") # Define the relevant column(s) fb_test <-as.character(FB_com$comment

tm Package error: Error definining Document Term Matrix

阅读更多关于 tm Package error: Error definining Document Term Matrix

问题 I am analyzing the Reuters 21578 corpus, all the Reuters news articles from 1987, using the "tm" package. After importing the XML files into an R data file, I clean the text--convert to plaintext, convert to lwer case, remove stop words etc. (as seen below)--then I try to convert the corpus to a document term matrix but receive an error message: Error in UseMethod("Content", x) : no applicable method for 'Content' applied to an object of class "character" All pre-processing steps work

Convert sparse matrix (csc_matrix) to pandas dataframe

阅读更多关于 Convert sparse matrix (csc_matrix) to pandas dataframe

问题 I want to convert this matrix into a pandas dataframe. csc_matrix The first number in the bracket should be the index , the second number being columns and the number in the end being the data . I want to do this to do feature selection in text analysis, the first number represents the document, the second being the feature of word and the last number being the TFIDF score. Getting a dataframe helps me to transform the text analysis problem into data analysis. 回答1: from scipy.sparse import