text-classification

Concept Behind The Transformed Data Of LDA Model

◇◆丶佛笑我妖孽 提交于 2020-01-05 03:36:18
问题 My question is related to Latent Dirichlet Allocation . Suppose we apply LDA on our dataset, then apply fit transform on that. the output is a matrix that is a collection of five documents. Each document consists of three topics. othe output is below: [[ 0.0922935 0.09218227 0.81552423] [ 0.81396651 0.09409428 0.09193921] [ 0.05265482 0.05240119 0.89494398] [ 0.05278187 0.89455775 0.05266038] [ 0.85209554 0.07338382 0.07452064]] So, this is the matrix that will be sent to a classification

Naive Bayes in Quanteda vs caret: wildly different results

眉间皱痕 提交于 2020-01-01 12:23:31
问题 I'm trying to use the packages quanteda and caret together to classify text based on a trained sample. As a test run, I wanted to compare the build-in naive bayes classifier of quanteda with the ones in caret . However, I can't seem to get caret to work right. Here is some code for reproduction. First on the quanteda side: library(quanteda) library(quanteda.corpora) library(caret) corp <- data_corpus_movies set.seed(300) id_train <- sample(docnames(corp), size = 1500, replace = FALSE) # get

How to use vector representation of words (as obtained from Word2Vec,etc) as features for a classifier?

我们两清 提交于 2020-01-01 04:11:45
问题 I am familiar with using BOW features for text classification, wherein we first find the size of the vocabulary for the corpus which becomes the size of our feature vector. For each sentence/document, and for all its constituent words, we then put 0/1 depending on the absence/presence of that word in that sentence/document. However, now that I am trying to use vector representation of each word, is creating a global vocabulary essential? 回答1: Suppose the size of the vectors is N (usually

Which decision_function_shape for sklearn.svm.SVC when using OneVsRestClassifier?

拈花ヽ惹草 提交于 2020-01-01 03:33:11
问题 I am doing multi-label classification where I am trying to predict correct tags to questions: (X = questions, y = list of tags for each question from X). I am wondering, which decision_function_shape for sklearn.svm.SVC should be be used with OneVsRestClassifier? From docs we can read that decision_function_shape can have two values 'ovo' and 'ovr' : decision_function_shape : ‘ovo’, ‘ovr’ or None, default=None Whether to return a one-vs-rest (‘ovr’) decision function of shape (n_samples, n

inconsistent shape error MultiLabelBinarizer on y_test, sklearn multi-label classification

风格不统一 提交于 2019-12-31 03:58:28
问题 import numpy as np import pandas as pd from sklearn.pipeline import Pipeline from sklearn.feature_extraction.text import CountVectorizer from sklearn.svm import LinearSVC from sklearn.linear_model import SGDClassifier from sklearn.feature_extraction.text import TfidfTransformer from sklearn.multiclass import OneVsRestClassifier from sklearn.metrics import accuracy_score, classification_report, confusion_matrix from sklearn.model_selection import train_test_split from sklearn import

How to assign an new observation to existing Kmeans clusters based on nearest cluster centriod logic in python?

时光怂恿深爱的人放手 提交于 2019-12-30 11:17:08
问题 I used the below code to create k-means clusters using Scikit learn. kmean = KMeans(n_clusters=nclusters,n_jobs=-1,random_state=2376,max_iter=1000,n_init=1000,algorithm='full',init='k-means++') kmean_fit = kmean.fit(clus_data) I also have saved the centroids using kmean_fit.cluster_centers_ I then pickled the K means object. filename = pickle_path+'\\'+'_kmean_fit.sav' pickle.dump(kmean_fit, open(filename, 'wb')) So that I can load the same kmeans pickle object and apply it to new data when

unable to use FeatureUnion in scikit-learn due to different dimensions

放肆的年华 提交于 2019-12-23 07:04:04
问题 I'm trying to use FeatureUnion to extract different features from a datastructure, but it fails due to different dimensions: ValueError: blocks[0,:] has incompatible row dimensions Implementaion My FeatureUnion is built the following way: features = FeatureUnion([ ('f1', Pipeline([ ('get', GetItemTransformer('f1')), ('transform', vectorizer_f1) ])), ('f2', Pipeline([ ('get', GetItemTransformer('f2')), ('transform', vectorizer_f1) ])) ]) GetItemTransformer is used to get different parts of

text classifier with bag of words and additional sentiment feature in sklearn

做~自己de王妃 提交于 2019-12-22 10:44:38
问题 I am trying to build a classifier that in addition to bag of words uses features like the sentiment or a topic (LDA result). I have a pandas DataFrame with the text and the label and would like to add a sentiment value (numerical between -5 and 5) and the result of LDA analysis (a string with the topic of the sentence). I have a working bag of words classifier that uses CountVectorizer from sklearn and performs the classification with MultinomialNaiveBayes. df = pd.DataFrame.from_records(data

SMOTE oversampling and cross-validation

泪湿孤枕 提交于 2019-12-22 06:28:47
问题 I am working on a binary classification problem in Weka with a highly imbalanced data set (90% in one category and 10% in the other). I first applied SMOTE (http://www.cs.cmu.edu/afs/cs/project/jair/pub/volume16/chawla02a-html/node6.html) to the entire data set to even out the categories and then performed 10-fold cross-validation over the newly obtained data. I found (overly?) optimistic results with F1 around 90%. Is this due to oversampling? Is it bad practice to perform cross-validation

Python text processing: AttributeError: 'list' object has no attribute 'lower'

我只是一个虾纸丫 提交于 2019-12-22 03:57:24
问题 I am new to Python and to Stackoverflow(please be gentle) and am trying to learn how to do a sentiment analysis. I am using a combination of code I found in a tutorial and here: Python - AttributeError: 'list' object has no attribute However, I keep getting Traceback (most recent call last): File "C:/Python27/training", line 111, in <module> processedTestTweet = processTweet(row) File "C:/Python27/training", line 19, in processTweet tweet = tweet.lower() AttributeError: 'list' object has no