问题
Hi i'm working with a tidy_text format and i am trying to substitute the strings "emails" and "emailing" into "email".
set.seed(123)
terms <- c("emails are nice", "emailing is fun", "computer freaks", "broken modem")
df <- data.frame(sentence = sample(terms, 100, replace = TRUE))
df
str(df)
df$sentence <- as.character(df$sentence)
tidy_df <- df %>%
unnest_tokens(word, sentence)
tidy_df %>%
count(word, sort = TRUE) %>%
filter( n > 20) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n)) +
geom_col() +
xlab(NULL) +
coord_flip()
this works fine, but when i use:
tidy_df <- gsub("emailing", "email", tidy_df)
to substitute words and run the bar chart again i get the following error message:
Error in UseMethod("group_by_") : no applicable method for 'group_by_' applied to an object of class "character"
Does any one know how to easily substitute words within tidy text formats without changing structure/class of the tidy_text?
回答1:
Removing the ends of words like that is called stemming and there are a couple of packages in R that will do that for you, if you'd like. One is the hunspell package from rOpenSci, and another option is the SnowballC package which implements Porter algorithm stemming. You would implement that like so:
library(dplyr)
library(tidytext)
library(SnowballC)
terms <- c("emails are nice", "emailing is fun", "computer freaks", "broken modem")
set.seed(123)
data_frame(txt = sample(terms, 100, replace = TRUE)) %>%
unnest_tokens(word, txt) %>%
mutate(word = wordStem(word))
#> # A tibble: 253 × 1
#> word
#> <chr>
#> 1 email
#> 2 i
#> 3 fun
#> 4 broken
#> 5 modem
#> 6 email
#> 7 i
#> 8 fun
#> 9 broken
#> 10 modem
#> # ... with 243 more rows
Notice that it is stemming all your text and that some of the words don't look like real words anymore; you may or may not care about that.
If you don't want to stem all your text using a stemmer like SnowballC or hunspell, you can use dplyr's if_else
within mutate()
to replace just specific words.
set.seed(123)
data_frame(txt = sample(terms, 100, replace = TRUE)) %>%
unnest_tokens(word, txt) %>%
mutate(word = if_else(word %in% c("emailing", "emails"), "email", word))
#> # A tibble: 253 × 1
#> word
#> <chr>
#> 1 email
#> 2 is
#> 3 fun
#> 4 broken
#> 5 modem
#> 6 email
#> 7 is
#> 8 fun
#> 9 broken
#> 10 modem
#> # ... with 243 more rows
Or it might make more sense for you to use str_replace
from the stringr package.
library(stringr)
set.seed(123)
data_frame(txt = sample(terms, 100, replace = TRUE)) %>%
unnest_tokens(word, txt) %>%
mutate(word = str_replace(word, "email(s|ing)", "email"))
#> # A tibble: 253 × 1
#> word
#> <chr>
#> 1 email
#> 2 is
#> 3 fun
#> 4 broken
#> 5 modem
#> 6 email
#> 7 is
#> 8 fun
#> 9 broken
#> 10 modem
#> # ... with 243 more rows
来源:https://stackoverflow.com/questions/43344108/word-substitution-within-tidy-text-format