问题
I have a column of firm names in an R dataframe that goes something like this:
"ABC Industries"
"ABC Enterprises"
"123 and 456 Corporation"
"XYZ Company"
And so on. I'm trying to generate frequency tables of every word that appears in this column, so for example, something like this:
Industries 10
Corporation 31
Enterprise 40
ABC 30
XYZ 40
I'm relatively new to R, so I was wondering of a good way to approach this. Should I be splitting the strings and placing every distinct word into a new column? Is there a way to split up a multi-word row into multiple rows with one word?
回答1:
If you wanted to, you could do it in a one-liner:
R> text <- c("ABC Industries", "ABC Enterprises",
+ "123 and 456 Corporation", "XYZ Company")
R> table(do.call(c, lapply(text, function(x) unlist(strsplit(x, " ")))))
123 456 ABC and Company
1 1 2 1 1
Corporation Enterprises Industries XYZ
1 1 1 1
R>
Here I use strsplit()
to break each entry intro components; this returns a list (within a list). I use do.call()
so simply concatenate all result lists into one vector, which table()
summarises.
回答2:
Here is another one-liner. It uses paste()
to combine all of the column entries into a single long text string, which it then splits apart and tabulates:
text <- c("ABC Industries", "ABC Enterprises",
"123 and 456 Corporation", "XYZ Company")
table(strsplit(paste(text, collapse=" "), " "))
回答3:
You can use the package tidytext
and dplyr
:
set.seed(42)
text <- c("ABC Industries", "ABC Enterprises",
"123 and 456 Corporation", "XYZ Company")
data <- data.frame(category = sample(text, 100, replace = TRUE),
stringsAsFactors = FALSE)
library(tidytext)
library(dplyr)
data %>%
unnest_tokens(word, category) %>%
group_by(word) %>%
count()
#> # A tibble: 9 x 2
#> # Groups: word [9]
#> word n
#> <chr> <int>
#> 1 123 29
#> 2 456 29
#> 3 abc 45
#> 4 and 29
#> 5 company 26
#> 6 corporation 29
#> 7 enterprises 21
#> 8 industries 24
#> 9 xyz 26
来源:https://stackoverflow.com/questions/8676158/splitting-strings-and-generating-frequency-tables-in-r