text-analysis

Categories Busineesses with Text analytics in Python

混江龙づ霸主 提交于 2020-01-05 09:04:48
问题 I'm a new-bee to AI and want to perform the below exercise. Can you please suggest the way to achieve it using python: Scenario - I have list of businesses of some companies as below like: 1. AI 2. Artificial Intelligence 3. VR 4. Virtual reality 5. Mobile application 6. Desktop softwares and want to categorize them as below: Technology ---> Category 1. AI ---> Category Artificial Intelligence 2. Artificial Intelligence ---> Category Artificial Intelligence 3. VR ---> Category Virtual Reality

ValueError: Found arrays with inconsistent numbers of samples [ 6 1786]

孤人 提交于 2019-12-31 02:28:31
问题 Here is my code: from sklearn.svm import SVC from sklearn.grid_search import GridSearchCV from sklearn.cross_validation import KFold from sklearn.feature_extraction.text import TfidfVectorizer from sklearn import datasets import numpy as np newsgroups = datasets.fetch_20newsgroups( subset='all', categories=['alt.atheism', 'sci.space'] ) X = newsgroups.data y = newsgroups.target TD_IF = TfidfVectorizer() y_scaled = TD_IF.fit_transform(newsgroups, y) grid = {'C': np.power(10.0, np.arange(-5, 6)

How to detect duplicates among text documents and return the duplicates' similarity?

荒凉一梦 提交于 2019-12-28 03:10:09
问题 I'm writing a crawler to get content from some website, but the content can duplicated, I want to avoid that. So I need a function can return the same percent between two text to detect two content maybe duplicated Example: Text 1:"I'm writing a crawler to" Text 2:"I'm writing a some text crawler to get" The compare function will return text 2 as the same text 1 by 5/8%(with 5 is words number of text 2 same text 1(compare by word order), and 8 is total words of text 2). If remove the "some

Extract relevant text from a .txt file in R

ⅰ亾dé卋堺 提交于 2019-12-25 19:11:08
问题 I am still on a basic beginner level with r. I am currently working on some natural language stuff and I use the ProQuest Newsstand database. Even though the database allows to download txt files, I don't need everything they provide. The files you can download there look like this: ############################################################################### ____________________________________________________________ Report Information from ProQuest 16 July 2016 09:58 ____________________

Extract relevant text from a .txt file in R

北慕城南 提交于 2019-12-25 19:11:02
问题 I am still on a basic beginner level with r. I am currently working on some natural language stuff and I use the ProQuest Newsstand database. Even though the database allows to download txt files, I don't need everything they provide. The files you can download there look like this: ############################################################################### ____________________________________________________________ Report Information from ProQuest 16 July 2016 09:58 ____________________

How to extract n-gram word sequences from text in Postgres

大憨熊 提交于 2019-12-24 04:57:06
问题 I am hoping to use Postgres to extract sequences of words from Text. For example the whole word trigrams for the following sentence "ed ut perspiciatis, unde omnis iste natus error sit voluptatem accusantium" would be "ed ut perspiciatis" "ut perspiciatis unde" "perspiciatis unde omnis" ... I have been doing this with R but I am hoping Postgres would be able to handle it more efficiently. I have seen a similar question asked here n-grams from text in PostgreSQL but I don't understand how to

Create sentence (row) to POS tags counts (column) matrix from a dataframe

自作多情 提交于 2019-12-22 12:37:16
问题 I am trying to build a matrix where the first row will be a part of speech, first column a sentence. values in the matrix should show the number of such POS in a sentence. So I am creating POS tags in this way: data = pd.read_csv(open('myfile.csv'),sep=';') target = data["label"] del data["label"] data.sentence = data.sentence.str.lower() # All strings in data frame to lowercase for line in data.sentence: Line_new= nltk.pos_tag(nltk.word_tokenize(line)) print(Line_new) The output is: [(

R Text Mining with quanteda

情到浓时终转凉″ 提交于 2019-12-22 00:28:03
问题 I have a data set (Facebook posts) (via netvizz) and I use the quanteda package in R. Here is my R code. # Load the relevant dictionary (relevant for analysis) liwcdict <- dictionary(file = "D:/LIWC2001_English.dic", format = "LIWC") # Read File # Facebooks posts could be generated by FB Netvizz # https://apps.facebook.com/netvizz # Load FB posts as .csv-file from .zip-file fbpost <- read.csv("D:/FB-com.csv", sep=";") # Define the relevant column(s) fb_test <-as.character(FB_com$comment

tm Package error: Error definining Document Term Matrix

心已入冬 提交于 2019-12-21 21:19:20
问题 I am analyzing the Reuters 21578 corpus, all the Reuters news articles from 1987, using the "tm" package. After importing the XML files into an R data file, I clean the text--convert to plaintext, convert to lwer case, remove stop words etc. (as seen below)--then I try to convert the corpus to a document term matrix but receive an error message: Error in UseMethod("Content", x) : no applicable method for 'Content' applied to an object of class "character" All pre-processing steps work

Convert sparse matrix (csc_matrix) to pandas dataframe

僤鯓⒐⒋嵵緔 提交于 2019-12-18 15:02:27
问题 I want to convert this matrix into a pandas dataframe. csc_matrix The first number in the bracket should be the index , the second number being columns and the number in the end being the data . I want to do this to do feature selection in text analysis, the first number represents the document, the second being the feature of word and the last number being the TFIDF score. Getting a dataframe helps me to transform the text analysis problem into data analysis. 回答1: from scipy.sparse import