I need a fast and concise way to split string literals in a data framte into a set of columns. Let's say I have this data frame
data <- data.frame(id=c(1,2,3), tok1=c("a, b, c", "a, a, d", "b, d, e"), tok2=c("alpha|bravo", "alpha|charlie", "tango|tango|delta") )
(pls note the different delimiters among columns)
The number of string columns is usually not known in advance (altough I can try to discover the whole set of cases if I've no alternatives)
I need two data frames like those:
tok1.occurrences:
+----+---+---+---+---+---+
| id | a | b | c | d | e |
+----+---+---+---+---+---+
| 1 | 1 | 1 | 1 | 0 | 0 |
| 2 | 2 | 0 | 0 | 1 | 0 |
| 3 | 0 | 1 | 0 | 1 | 1 |
+----+---+---+---+---+---+
tok2.occurrences:
+----+-------+-------+---------+-------+-------+
| id | alpha | bravo | charlie | delta | tango |
+----+-------+-------+---------+-------+-------+
| 1 | 1 | 1 | 0 | 0 | 0 |
| 2 | 1 | 0 | 1 | 0 | 0 |
| 3 | 0 | 0 | 0 | 1 | 2 |
+----+-------+-------+---------+-------+-------+
I tried using this syntax:
tok1.f = factor(data$tok1)
dummies <- model.matrix(~tok1.f)
this ended up in a incomplete solution. It creates my dummy vars correctly, but not (obviously) splitting against the delimiter.
I know i can use the 'tm' package to find a document-term matrix, but it's seems way too much for such simple tokenization. Is there a more straight way?
The easiest thing that I can think of is to use my cSplit
function in conjunction with dcast.data.table
, like this:
library(splitstackshape)
dcast.data.table(cSplit(data, "tok1", ", ", "long"),
id ~ tok1, value.var = "tok1",
fun.aggregate = length)
# id a b c d e
# 1: 1 1 1 1 0 0
# 2: 2 2 0 0 1 0
# 3: 3 0 1 0 1 1
dcast.data.table(cSplit(data, "tok2", "|", "long"),
id ~ tok2, value.var = "tok2",
fun.aggregate = length)
# id alpha bravo charlie delta tango
# 1: 1 1 1 0 0 0
# 2: 2 1 0 1 0 0
# 3: 3 0 0 0 1 2
Edit: Updated with library(splitstackshape)
since cSplit
is now part of that package.
If you don't mind using data.table
(temporarily), this might work for you:
library(data.table)
data <- data.frame(id=c(1,2,3),
tok1=c("a, b, c", "a, a, d", "b, d, e"),
tok2=c("alpha|bravo", "alpha|charlie", "tango|tango|delta"))
splitCols <- function(col_name, data) {
# strsplit needs strings
data[, col_name] <- as.character(data[, col_name])
# make a list of single row data frames from the tabulation
# of each of items from the split column
tokens <- lapply(strsplit(data[, col_name], "[^[:alnum:]]+"), function(x) {
tab <- table(x)
setNames(rbind.data.frame(as.numeric(tab)), names(tab))
})
# use data.table's rbindlist, filling in missing values
rbl <- rbindlist(tokens, fill=TRUE)
# 0 out the NA's
rbl[is.na(rbl)] <- 0
# add the "id" column
cbind(id=data$id, rbl)
}
lapply(names(data)[-1], splitCols, data)
## [[1]]
## id a b c d e
## 1: 1 1 1 1 0 0
## 2: 2 2 0 0 1 0
## 3: 3 0 1 0 1 1
##
## [[2]]
## id alpha bravo charlie delta tango
## 1: 1 1 1 0 0 0
## 2: 2 1 0 1 0 0
## 3: 3 0 0 0 1 2
You end up with a list of data frames that you can then process as you see fit.
You could use stringr
package as follows:
require(stringr)
test_data <- data.frame(id=c(1,2,3), tok1=c("a, b, c", "a, a, d", "b, d, e"), tok2=c("alpha|bravo", "alpha|charlie", "tango|tango|delta") )
#conversion to character class and uniform delimeter as ","
test_data$tok1<-as.character(test_data$tok1)
test_data$tok1<-gsub(" ","",test_data$tok1)
test_data$tok2=gsub("\\|",",",as.character(test_data$tok2))
#Unique list of elements for each column
tok1.uniq=sort(unique(unlist(strsplit(as.character(test_data$tok1),","))))
tok2.uniq=sort(unique(unlist(strsplit(as.character(test_data$tok2),","))))
#Token count for each column
#In each row of token, find the count of characters using str_count from stringr package
Column one:
tok1.occurances=do.call(cbind,lapply(tok1.uniq,function(x) {
DF=data.frame(do.call(rbind,lapply(test_data$tok1,function(y,z=x) str_count(y,z))))
colnames(DF) = x
return(DF)
}
))
#Add ID number as column
tok1.occurances=data.frame(id=as.numeric(row.names(tok1.occurances)),tok1.occurances,stringsAsFactors=FALSE)
# > tok1.occurances
# id a b c d e
# 1 1 1 1 0 0
# 2 2 0 0 1 0
# 3 0 1 0 1 1
Column two:
tok2.occurances=do.call(cbind,lapply(tok2.uniq,function(x) {
DF=data.frame(do.call(rbind,lapply(test_data$tok2,function(y,z=x) str_count(y,z))))
colnames(DF) = x
return(DF)
}
))
tok2.occurances=data.frame(id=as.numeric(row.names(tok2.occurances)),tok2.occurances,stringsAsFactors=FALSE)
# > tok2.occurances
# id alpha bravo charlie delta tango
# 1 1 1 0 0 0
# 2 1 0 1 0 0
# 3 0 0 0 1 2
来源:https://stackoverflow.com/questions/26012169/split-delimited-strings-into-distinct-columns-in-r-dataframe