tidytext

Converting data frame to tibble with word count

坚强是说给别人听的谎言 提交于 2019-12-11 05:52:18
问题 I'm attempting to perform sentiment analysis based on http://tidytextmining.com/sentiment.html#the-sentiments-dataset . Prior to performing sentiment analysis I need to convert my dataset into a tidy format. my dataset is of form : x <- c( "test1" , "test2") y <- c( "this is test text1" , "this is test text2") res <- data.frame( "url" = x, "text" = y) res url text 1 test1 this is test text1 2 test2 this is test text2 In order to convert to one observation per row require to process text

Does tidytext::unnest_tokens works with spanish characters?

瘦欲@ 提交于 2019-12-08 00:20:48
问题 I am trying to use unnest_tokens with spanish text. It works fine with unigrams, but breaks the special characters with bigrams. The code works fine on Linux. I added some info on the locale. library(tidytext) library(dplyr) df <- data_frame( text = "César Moreira Nuñez" ) # works ok: df %>% unnest_tokens(word, text) # # A tibble: 3 x 1 # word # <chr> # 1 césar # 2 moreira # 3 nuñez # breaks é and ñ df %>% unnest_tokens(bigram, text, token = "ngrams", n = 2 ) # # A tibble: 2 x 1 # bigram #

Opposite of unnest_tokens

☆樱花仙子☆ 提交于 2019-12-07 01:26:42
问题 This is most likely a stupid question, but I've googled and googled and can't find a solution. I think it's because I don't know the right way to word my question to search. I have a data frame that I have converted to tidy text format in R to get rid of stop words. I would now like to 'untidy' that data frame back to its original format. What's the opposite / inverse command of unnest_tokens? Edit: here is what the data I'm working with look like. I'm trying to replicate analyses from Silge

Does tidytext::unnest_tokens works with spanish characters?

断了今生、忘了曾经 提交于 2019-12-06 13:06:24
I am trying to use unnest_tokens with spanish text. It works fine with unigrams, but breaks the special characters with bigrams. The code works fine on Linux. I added some info on the locale. library(tidytext) library(dplyr) df <- data_frame( text = "César Moreira Nuñez" ) # works ok: df %>% unnest_tokens(word, text) # # A tibble: 3 x 1 # word # <chr> # 1 césar # 2 moreira # 3 nuñez # breaks é and ñ df %>% unnest_tokens(bigram, text, token = "ngrams", n = 2 ) # # A tibble: 2 x 1 # bigram # <chr> # 1 cã©sar moreira # 2 moreira nuã±ez > Sys.getlocale() [1] "LC_COLLATE=English_United States.1252

Using tidytext and broom but not finding tidier for LDA_VEM

梦想的初衷 提交于 2019-12-04 04:52:24
问题 The tidytext book has examples with a tidier for topicmodels: library(tidyverse) library(tidytext) library(topicmodels) library(broom) year_word_counts <- tibble(year = c("2007", "2008", "2009"), + word = c("dog", "cat", "chicken"), + n = c(1753L, 1157L, 1057L)) animal_dtm <- cast_dtm(data = year_word_counts, document = year, term = word, value = n) animal_lda <- LDA(animal_dtm, k = 5, control = list( seed = 1234)) animal_lda <- tidy(animal_lda, matrix = "beta") # Console output Error in as

Graph with ordered bars and using facets

北城余情 提交于 2019-12-02 05:34:46
问题 I am trying to make a graph with ordered bars according to frequency and also using a variable two separate two variables using facets. Words have to be ordered by value given in 'n' variable. So, my graph should look like this one which appears in tidytext book: My graph bellow, words are not ordered by value , what is my mistake?: My data looks like the one in the example: > d # A tibble: 20 x 3 word u_c n <chr> <chr> <dbl> 1 apples candidate 0.567 2 apples user 0.274 3 melon user 0.191 4

Using tidytext and broom but not finding tidier for LDA_VEM

空扰寡人 提交于 2019-12-02 02:17:38
The tidytext book has examples with a tidier for topicmodels: library(tidyverse) library(tidytext) library(topicmodels) library(broom) year_word_counts <- tibble(year = c("2007", "2008", "2009"), + word = c("dog", "cat", "chicken"), + n = c(1753L, 1157L, 1057L)) animal_dtm <- cast_dtm(data = year_word_counts, document = year, term = word, value = n) animal_lda <- LDA(animal_dtm, k = 5, control = list( seed = 1234)) animal_lda <- tidy(animal_lda, matrix = "beta") # Console output Error in as.data.frame.default(x) : cannot coerce class "structure("LDA_VEM", package = "topicmodels")" to a data

Graph with ordered bars and using facets

情到浓时终转凉″ 提交于 2019-12-02 00:49:21
I am trying to make a graph with ordered bars according to frequency and also using a variable two separate two variables using facets. Words have to be ordered by value given in 'n' variable. So, my graph should look like this one which appears in tidytext book: My graph bellow, words are not ordered by value , what is my mistake?: My data looks like the one in the example: > d # A tibble: 20 x 3 word u_c n <chr> <chr> <dbl> 1 apples candidate 0.567 2 apples user 0.274 3 melon user 0.191 4 curcuma candidate 0.105 5 banana user 0.0914 6 kiwi candidate 0.0565 ... Following the code provided in

Web scraping pdf files from HTML

◇◆丶佛笑我妖孽 提交于 2019-12-01 11:29:20
问题 How can I scrap the pdf documents from HTML? I am using R and I can do only extract the text from HTML. The example of the website that I am going to scrap is as follows. https://www.bot.or.th/English/MonetaryPolicy/Northern/EconomicReport/Pages/Releass_Economic_north.aspx Regards 回答1: When you say you want to scrape the PDF files from HTML pages, I think the first problem you face is to actually identify the location of those PDF files. library(XML) library(RCurl) url <- "https://www.bot.or