I have some mixed-type data that I would like to store in an R data structure of some sort. Each data point has a set of fixed attributes which may be 1-d numeric, factors, or
Trying to shoehorn the data into a data frame seems hackish to me. Far better to consider each row as an individual object, then think of the dataset as an array of these objects.
This function converts your data strings to an appropriate format. (This is S3 style code; you may prefer to use one of the 'proper' object oriented systems.)
as.mydata <- function(x)
{
UseMethod("as.mydata")
}
as.mydata.character <- function(x)
{
convert <- function(x)
{
md <- list()
md$phrase = x
spl <- strsplit(x, " ")[[1]]
md$num_words <- length(spl)
md$token_lengths <- nchar(spl)
class(md) <- "mydata"
md
}
lapply(x, convert)
}
Now your whole dataset looks like
mydataset <- as.mydata(c("hello world", "greetings", "take me to your leader"))
mydataset
[[1]]
$phrase
[1] "hello world"
$num_words
[1] 2
$token_lengths
[1] 5 5
attr(,"class")
[1] "mydata"
[[2]]
$phrase
[1] "greetings"
$num_words
[1] 1
$token_lengths
[1] 9
attr(,"class")
[1] "mydata"
[[3]]
$phrase
[1] "take me to your leader"
$num_words
[1] 5
$token_lengths
[1] 4 2 2 4 6
attr(,"class")
[1] "mydata"
You can define a print method to make this look prettier.
print.mydata <- function(x)
{
cat(x$phrase, "consists of", x$num_words, "words, with", paste(x$token_lengths, collapse=", "), "letters.")
}
mydataset
[[1]]
hello world consists of 2 words, with 5, 5 letters.
[[2]]
greetings consists of 1 words, with 9 letters.
[[3]]
take me to your leader consists of 5 words, with 4, 2, 2, 4, 6 letters.
The sample operations you wanted to do are fairly straightforward with data in this format.
sapply(mydataset, function(x) nchar(x$phrase) > 10)
[1] TRUE FALSE TRUE