text-classification

Generating dictionaries to categorize tweets into pre-defined categories using NLTK

给你一囗甜甜゛ 提交于 2020-06-24 12:21:19
问题 I have a list of twitter users (screen_names) and I need to categorise them into 7 pre-defined categories - Education, Art, Sports, Business, Politics, Automobiles, Technology based on thier interest area. I have extracted last 100 tweets of the users in Python and created a corpus for each user after cleaning the tweets. As mentioned here Tweet classification into multiple categories on (Unsupervised data/tweets) : I am trying to generate dictionaries of common words under each category so

Generating dictionaries to categorize tweets into pre-defined categories using NLTK

别等时光非礼了梦想. 提交于 2020-06-24 12:17:46
问题 I have a list of twitter users (screen_names) and I need to categorise them into 7 pre-defined categories - Education, Art, Sports, Business, Politics, Automobiles, Technology based on thier interest area. I have extracted last 100 tweets of the users in Python and created a corpus for each user after cleaning the tweets. As mentioned here Tweet classification into multiple categories on (Unsupervised data/tweets) : I am trying to generate dictionaries of common words under each category so

Spacy TextCat Score in MultiLabel Classfication

≡放荡痞女 提交于 2020-06-17 09:39:10
问题 In the spacy's text classification train_textcat example, there are two labels specified Positive and Negative . Hence the cats score is represented as cats = [{"POSITIVE": bool(y), "NEGATIVE": not bool(y)} for y in labels] I am working with Multilabel classfication which means i have more than two labels to tag in one text. I have added my labels as textcat.add_label("CONSTRUCTION") and to specify cats score I have used cats = [{"POSITIVE": bool(y), "NEGATIVE": not bool(y)} for y in labels]

How to use Hugging Face Transformers library in Tensorflow for text classification on custom data?

谁都会走 提交于 2020-05-29 03:28:32
问题 I am trying to do binary text classification on custom data (which is in csv format) using different transformer architectures that Hugging Face 'Transformers' library offers. I am using this Tensorflow blog post as reference. I am loading the custom dataset into 'tf.data.Dataset' format using the following code: def get_dataset(file_path, **kwargs): dataset = tf.data.experimental.make_csv_dataset( file_path, batch_size=5, # Artificially small to make examples easier to show. na_value="", num

Feature hashing in R for Text classification

佐手、 提交于 2020-02-24 09:14:49
问题 I'm trying to implement feature hashing in R to help me with a text classification problem, but i'm not sure if i'm doing it the way it should be. Part of my code is based on this post: Hashing function for mapping integers to a given range?. My code: random.data = function(n = 200, wlen = 40, ncol = 10){ random.word = function(n){ paste0(sample(c(letters, 0:9), n, TRUE), collapse = '') } matrix(replicate(n, random.word(wlen)), ncol = ncol) } feature_hash = function(doc, N){ doc = as.matrix

Text Classification(Spacy) in place of Gensim

佐手、 提交于 2020-01-25 06:47:09
问题 Hello i am using gemsin library for semantic text similarity classification but i am failed to load the data of gemsin file even it takes too much time to execute the program when we use jupyter notebbok and run cells. So my question is that can we use spacy library to overcome this type of error and can we fount out the similarity between two document files.i have seen tf-idf for semantic similarity here is error MemoryError: Unable to allocate 3.35 GiB for an array with shape (3000000, 300)

R: problems applying LIME to quanteda text model

时光毁灭记忆、已成空白 提交于 2020-01-14 02:43:40
问题 it's a modified version of my previous question: I'm trying to run LIME on my quanteda text model that feeds off Trump & Clinton tweets data. I run it following an example given by Thomas Pedersen in his Understanding LIME and useuful SO answer provided by @Weihuang Wong: library(dplyr) library(stringr) library(quanteda) library(lime) #data prep tweet_csv <- read_csv("tweets.csv") # creating corpus and dfm for train and test sets get_matrix <- function(df){ corpus <- quanteda::corpus(df) dfm

R: problems applying LIME to quanteda text model

梦想与她 提交于 2020-01-14 02:43:12
问题 it's a modified version of my previous question: I'm trying to run LIME on my quanteda text model that feeds off Trump & Clinton tweets data. I run it following an example given by Thomas Pedersen in his Understanding LIME and useuful SO answer provided by @Weihuang Wong: library(dplyr) library(stringr) library(quanteda) library(lime) #data prep tweet_csv <- read_csv("tweets.csv") # creating corpus and dfm for train and test sets get_matrix <- function(df){ corpus <- quanteda::corpus(df) dfm

How to do Text classification using tensorflow?

喜你入骨 提交于 2020-01-11 11:57:09
问题 I am new to tensorflow and machine learning. I am facing issues with writing a tensorflow code which does the text classification similar to one I tried using sklearn libraries. I am facing major issues with vectorising the dataset and providing the input to tensorflow layers. I do remember being successful in one hot encoding the labels but the tensorflow layer ahead did not accept the created array. Please note, I have read majority of text clasification answered questions on stackoverflow

Save progress between multiple instances of partial_fit in Python SGDClassifier

强颜欢笑 提交于 2020-01-07 02:54:36
问题 I've successfully followed this example for my own text classification script. The problem is I'm not looking to process pieces of a huge, but existing data set in a loop of partial_fit calls, like they do in the example. I want to be able to add data as it becomes available, even if I shut down my python script in the meantime. Ideally I'd like to do something like this: sometime in 2015: model2015=partial_fit(dataset2015) save_to_file(model2015) shut down my python script sometime in 2016: