tidytext | 易学教程

Converting data frame to tibble with word count

阅读更多关于 Converting data frame to tibble with word count

问题 I'm attempting to perform sentiment analysis based on http://tidytextmining.com/sentiment.html#the-sentiments-dataset . Prior to performing sentiment analysis I need to convert my dataset into a tidy format. my dataset is of form : x <- c( "test1" , "test2") y <- c( "this is test text1" , "this is test text2") res <- data.frame( "url" = x, "text" = y) res url text 1 test1 this is test text1 2 test2 this is test text2 In order to convert to one observation per row require to process text

Does tidytext::unnest_tokens works with spanish characters?

阅读更多关于 Does tidytext::unnest_tokens works with spanish characters?

问题 I am trying to use unnest_tokens with spanish text. It works fine with unigrams, but breaks the special characters with bigrams. The code works fine on Linux. I added some info on the locale. library(tidytext) library(dplyr) df <- data_frame( text = "César Moreira Nuñez" ) # works ok: df %>% unnest_tokens(word, text) # # A tibble: 3 x 1 # word # <chr> # 1 césar # 2 moreira # 3 nuñez # breaks é and ñ df %>% unnest_tokens(bigram, text, token = "ngrams", n = 2 ) # # A tibble: 2 x 1 # bigram #

Opposite of unnest_tokens

阅读更多关于 Opposite of unnest_tokens

问题 This is most likely a stupid question, but I've googled and googled and can't find a solution. I think it's because I don't know the right way to word my question to search. I have a data frame that I have converted to tidy text format in R to get rid of stop words. I would now like to 'untidy' that data frame back to its original format. What's the opposite / inverse command of unnest_tokens? Edit: here is what the data I'm working with look like. I'm trying to replicate analyses from Silge

Does tidytext::unnest_tokens works with spanish characters?

阅读更多关于 Does tidytext::unnest_tokens works with spanish characters?

I am trying to use unnest_tokens with spanish text. It works fine with unigrams, but breaks the special characters with bigrams. The code works fine on Linux. I added some info on the locale. library(tidytext) library(dplyr) df <- data_frame( text = "César Moreira Nuñez" ) # works ok: df %>% unnest_tokens(word, text) # # A tibble: 3 x 1 # word # <chr> # 1 césar # 2 moreira # 3 nuñez # breaks é and ñ df %>% unnest_tokens(bigram, text, token = "ngrams", n = 2 ) # # A tibble: 2 x 1 # bigram # <chr> # 1 cã©sar moreira # 2 moreira nuã±ez > Sys.getlocale() [1] "LC_COLLATE=English_United States.1252

Using tidytext and broom but not finding tidier for LDA_VEM

阅读更多关于 Using tidytext and broom but not finding tidier for LDA_VEM

问题 The tidytext book has examples with a tidier for topicmodels: library(tidyverse) library(tidytext) library(topicmodels) library(broom) year_word_counts <- tibble(year = c("2007", "2008", "2009"), + word = c("dog", "cat", "chicken"), + n = c(1753L, 1157L, 1057L)) animal_dtm <- cast_dtm(data = year_word_counts, document = year, term = word, value = n) animal_lda <- LDA(animal_dtm, k = 5, control = list( seed = 1234)) animal_lda <- tidy(animal_lda, matrix = "beta") # Console output Error in as

Graph with ordered bars and using facets

阅读更多关于 Graph with ordered bars and using facets

问题 I am trying to make a graph with ordered bars according to frequency and also using a variable two separate two variables using facets. Words have to be ordered by value given in 'n' variable. So, my graph should look like this one which appears in tidytext book: My graph bellow, words are not ordered by value , what is my mistake?: My data looks like the one in the example: > d # A tibble: 20 x 3 word u_c n <chr> <chr> <dbl> 1 apples candidate 0.567 2 apples user 0.274 3 melon user 0.191 4

Using tidytext and broom but not finding tidier for LDA_VEM

阅读更多关于 Using tidytext and broom but not finding tidier for LDA_VEM

The tidytext book has examples with a tidier for topicmodels: library(tidyverse) library(tidytext) library(topicmodels) library(broom) year_word_counts <- tibble(year = c("2007", "2008", "2009"), + word = c("dog", "cat", "chicken"), + n = c(1753L, 1157L, 1057L)) animal_dtm <- cast_dtm(data = year_word_counts, document = year, term = word, value = n) animal_lda <- LDA(animal_dtm, k = 5, control = list( seed = 1234)) animal_lda <- tidy(animal_lda, matrix = "beta") # Console output Error in as.data.frame.default(x) : cannot coerce class "structure("LDA_VEM", package = "topicmodels")" to a data

Graph with ordered bars and using facets

阅读更多关于 Graph with ordered bars and using facets

I am trying to make a graph with ordered bars according to frequency and also using a variable two separate two variables using facets. Words have to be ordered by value given in 'n' variable. So, my graph should look like this one which appears in tidytext book: My graph bellow, words are not ordered by value , what is my mistake?: My data looks like the one in the example: > d # A tibble: 20 x 3 word u_c n <chr> <chr> <dbl> 1 apples candidate 0.567 2 apples user 0.274 3 melon user 0.191 4 curcuma candidate 0.105 5 banana user 0.0914 6 kiwi candidate 0.0565 ... Following the code provided in

Web scraping pdf files from HTML

阅读更多关于 Web scraping pdf files from HTML

问题 How can I scrap the pdf documents from HTML? I am using R and I can do only extract the text from HTML. The example of the website that I am going to scrap is as follows. https://www.bot.or.th/English/MonetaryPolicy/Northern/EconomicReport/Pages/Releass_Economic_north.aspx Regards 回答1: When you say you want to scrape the PDF files from HTML pages, I think the first problem you face is to actually identify the location of those PDF files. library(XML) library(RCurl) url <- "https://www.bot.or