text-classification | 易学教程

Concept Behind The Transformed Data Of LDA Model

阅读更多关于 Concept Behind The Transformed Data Of LDA Model

问题 My question is related to Latent Dirichlet Allocation . Suppose we apply LDA on our dataset, then apply fit transform on that. the output is a matrix that is a collection of five documents. Each document consists of three topics. othe output is below: [[ 0.0922935 0.09218227 0.81552423] [ 0.81396651 0.09409428 0.09193921] [ 0.05265482 0.05240119 0.89494398] [ 0.05278187 0.89455775 0.05266038] [ 0.85209554 0.07338382 0.07452064]] So, this is the matrix that will be sent to a classification

Naive Bayes in Quanteda vs caret: wildly different results

阅读更多关于 Naive Bayes in Quanteda vs caret: wildly different results

问题 I'm trying to use the packages quanteda and caret together to classify text based on a trained sample. As a test run, I wanted to compare the build-in naive bayes classifier of quanteda with the ones in caret . However, I can't seem to get caret to work right. Here is some code for reproduction. First on the quanteda side: library(quanteda) library(quanteda.corpora) library(caret) corp <- data_corpus_movies set.seed(300) id_train <- sample(docnames(corp), size = 1500, replace = FALSE) # get

How to use vector representation of words (as obtained from Word2Vec,etc) as features for a classifier?

阅读更多关于 How to use vector representation of words (as obtained from Word2Vec,etc) as features for a classifier?

问题 I am familiar with using BOW features for text classification, wherein we first find the size of the vocabulary for the corpus which becomes the size of our feature vector. For each sentence/document, and for all its constituent words, we then put 0/1 depending on the absence/presence of that word in that sentence/document. However, now that I am trying to use vector representation of each word, is creating a global vocabulary essential? 回答1: Suppose the size of the vectors is N (usually

Which decision_function_shape for sklearn.svm.SVC when using OneVsRestClassifier?

阅读更多关于 Which decision_function_shape for sklearn.svm.SVC when using OneVsRestClassifier?

问题 I am doing multi-label classification where I am trying to predict correct tags to questions: (X = questions, y = list of tags for each question from X). I am wondering, which decision_function_shape for sklearn.svm.SVC should be be used with OneVsRestClassifier? From docs we can read that decision_function_shape can have two values 'ovo' and 'ovr' : decision_function_shape : ‘ovo’, ‘ovr’ or None, default=None Whether to return a one-vs-rest (‘ovr’) decision function of shape (n_samples, n

inconsistent shape error MultiLabelBinarizer on y_test, sklearn multi-label classification

阅读更多关于 inconsistent shape error MultiLabelBinarizer on y_test, sklearn multi-label classification

问题 import numpy as np import pandas as pd from sklearn.pipeline import Pipeline from sklearn.feature_extraction.text import CountVectorizer from sklearn.svm import LinearSVC from sklearn.linear_model import SGDClassifier from sklearn.feature_extraction.text import TfidfTransformer from sklearn.multiclass import OneVsRestClassifier from sklearn.metrics import accuracy_score, classification_report, confusion_matrix from sklearn.model_selection import train_test_split from sklearn import

How to assign an new observation to existing Kmeans clusters based on nearest cluster centriod logic in python?

阅读更多关于 How to assign an new observation to existing Kmeans clusters based on nearest cluster centriod logic in python?

问题 I used the below code to create k-means clusters using Scikit learn. kmean = KMeans(n_clusters=nclusters,n_jobs=-1,random_state=2376,max_iter=1000,n_init=1000,algorithm='full',init='k-means++') kmean_fit = kmean.fit(clus_data) I also have saved the centroids using kmean_fit.cluster_centers_ I then pickled the K means object. filename = pickle_path+'\\'+'_kmean_fit.sav' pickle.dump(kmean_fit, open(filename, 'wb')) So that I can load the same kmeans pickle object and apply it to new data when

unable to use FeatureUnion in scikit-learn due to different dimensions

阅读更多关于 unable to use FeatureUnion in scikit-learn due to different dimensions

问题 I'm trying to use FeatureUnion to extract different features from a datastructure, but it fails due to different dimensions: ValueError: blocks[0,:] has incompatible row dimensions Implementaion My FeatureUnion is built the following way: features = FeatureUnion([ ('f1', Pipeline([ ('get', GetItemTransformer('f1')), ('transform', vectorizer_f1) ])), ('f2', Pipeline([ ('get', GetItemTransformer('f2')), ('transform', vectorizer_f1) ])) ]) GetItemTransformer is used to get different parts of

text classifier with bag of words and additional sentiment feature in sklearn

阅读更多关于 text classifier with bag of words and additional sentiment feature in sklearn

问题 I am trying to build a classifier that in addition to bag of words uses features like the sentiment or a topic (LDA result). I have a pandas DataFrame with the text and the label and would like to add a sentiment value (numerical between -5 and 5) and the result of LDA analysis (a string with the topic of the sentence). I have a working bag of words classifier that uses CountVectorizer from sklearn and performs the classification with MultinomialNaiveBayes. df = pd.DataFrame.from_records(data

SMOTE oversampling and cross-validation

阅读更多关于 SMOTE oversampling and cross-validation

问题 I am working on a binary classification problem in Weka with a highly imbalanced data set (90% in one category and 10% in the other). I first applied SMOTE (http://www.cs.cmu.edu/afs/cs/project/jair/pub/volume16/chawla02a-html/node6.html) to the entire data set to even out the categories and then performed 10-fold cross-validation over the newly obtained data. I found (overly?) optimistic results with F1 around 90%. Is this due to oversampling? Is it bad practice to perform cross-validation

Python text processing: AttributeError: 'list' object has no attribute 'lower'

阅读更多关于 Python text processing: AttributeError: 'list' object has no attribute 'lower'

问题 I am new to Python and to Stackoverflow(please be gentle) and am trying to learn how to do a sentiment analysis. I am using a combination of code I found in a tutorial and here: Python - AttributeError: 'list' object has no attribute However, I keep getting Traceback (most recent call last): File "C:/Python27/training", line 111, in <module> processedTestTweet = processTweet(row) File "C:/Python27/training", line 19, in processTweet tweet = tweet.lower() AttributeError: 'list' object has no