Best way to store variable-length data in an R data.frame?

前端 未结 5 1182
日久生厌
日久生厌 2021-02-06 02:59

I have some mixed-type data that I would like to store in an R data structure of some sort. Each data point has a set of fixed attributes which may be 1-d numeric, factors, or

相关标签:
5条回答
  • 2021-02-06 03:38

    I would just use the data in the "long" format.

    E.g.

    > d1 <- data.frame(id=1:3, num_words=c(2,1,4), phrase=c("hello world", "greetings", "take me to your leader"))
    > d2 <- data.frame(id=c(rep(1,2), rep(2,1), rep(3,5)), token_length=c(5,5,9,4,2,2,4,6))
    > d2$tokenid <- with(d2, ave(token_length, id, FUN=seq_along))
    > d <- merge(d1,d2)
    > subset(d, nchar(phrase) > 10)
      id num_words                 phrase token_length tokenid
    1  1         2            hello world            5       1
    2  1         2            hello world            5       2
    4  3         4 take me to your leader            4       1
    5  3         4 take me to your leader            2       2
    6  3         4 take me to your leader            2       3
    7  3         4 take me to your leader            4       4
    8  3         4 take me to your leader            6       5
    > with(d, tapply(token_length, id, mean))
      1   2   3 
    5.0 9.0 3.6 
    

    Once the data is in the long format, you can use sqldf or plyr to extract what you want from it.

    0 讨论(0)
  • 2021-02-06 03:39

    Another option would be to convert your data frame into a matrix of mode list - each element of the matrix would be a list. standard array operations (slicing with [, apply(), etc. would be applicable).

    > d <- data.frame(id=c(1,2,3), num_tokens=c(2,1,4), token_lengths=as.array(list(c(5,5), 9, c(4,2,2,4,6))))
    > m <- as.matrix(d)
    > mode(m)
    [1] "list"
    > m[,"token_lengths"]
    [[1]]
    [1] 5 5
    
    [[2]]
    [1] 9
    
    [[3]]
    [1] 4 2 2 4 6
    
    > m[3,]
    $id
    [1] 3
    
    $num_tokens
    [1] 4
    
    $token_lengths
    [1] 4 2 2 4 6
    
    0 讨论(0)
  • 2021-02-06 03:45

    Trying to shoehorn the data into a data frame seems hackish to me. Far better to consider each row as an individual object, then think of the dataset as an array of these objects.

    This function converts your data strings to an appropriate format. (This is S3 style code; you may prefer to use one of the 'proper' object oriented systems.)

    as.mydata <- function(x)
    {
       UseMethod("as.mydata")
    }
    
    as.mydata.character <- function(x)
    {
       convert <- function(x)
       {
          md <- list()
          md$phrase = x
          spl <- strsplit(x, " ")[[1]]
          md$num_words <- length(spl)
          md$token_lengths <- nchar(spl)
          class(md) <- "mydata"
          md
       }
       lapply(x, convert)
    }
    

    Now your whole dataset looks like

    mydataset <- as.mydata(c("hello world", "greetings", "take me to your leader"))
    
    mydataset
    [[1]]
    $phrase
    [1] "hello world"
    
    $num_words
    [1] 2
    
    $token_lengths
    [1] 5 5
    
    attr(,"class")
    [1] "mydata"
    
    [[2]]
    $phrase
    [1] "greetings"
    
    $num_words
    [1] 1
    
    $token_lengths
    [1] 9
    
    attr(,"class")
    [1] "mydata"
    
    [[3]]
    $phrase
    [1] "take me to your leader"
    
    $num_words
    [1] 5
    
    $token_lengths
    [1] 4 2 2 4 6
    
    attr(,"class")
    [1] "mydata"
    

    You can define a print method to make this look prettier.

    print.mydata <- function(x)
    {
       cat(x$phrase, "consists of", x$num_words, "words, with", paste(x$token_lengths, collapse=", "), "letters.")
    }
    mydataset
    [[1]]
    hello world consists of 2 words, with 5, 5 letters.
    [[2]]
    greetings consists of 1 words, with 9 letters.
    [[3]]
    take me to your leader consists of 5 words, with 4, 2, 2, 4, 6 letters.
    

    The sample operations you wanted to do are fairly straightforward with data in this format.

    sapply(mydataset, function(x) nchar(x$phrase) > 10)
    [1]  TRUE FALSE  TRUE
    
    0 讨论(0)
  • 2021-02-06 03:54

    I would also use strings for the variable length data, but as in the following example: "c(5,5)" for the first phrase. One needs to use eval(parse(text=...)) to carry out computations.

    For example, the mean can be computed as follows:

    sapply(data$token_lengths,function(str) mean(eval(parse(text=str))))

    0 讨论(0)
  • 2021-02-06 03:59

    Since the R data frame structure is based loosely on the SQL table, having each element of the data frame be anything other than an atomic data type is uncommon. However, it can be done, as you've shown, and this linked post describes such an application implemented on a larger scale.

    An alternative is to store your data as a string and have a function to retrieve it, or create a separate function to which the data is attached and extract it using indices stored in your data frame.

    > ## alternative 1
    > tokens <- function(x,i=TRUE) Map(as.numeric,strsplit(x[i],","))
    > d <- data.frame(id=c(1,2,3), token_lengths=c("5,5", "9", "4,2,2,4,6"))
    > 
    > tokens(d$token_lengths)
    [[1]]
    [1] 5 5
    
    [[2]]
    [1] 9
    
    [[3]]
    [1] 4 2 2 4 6
    
    > tokens(d$token_lengths,2:3)
    [[1]]
    [1] 9
    
    [[2]]
    [1] 4 2 2 4 6
    
    > 
    > ## alternative 2
    > retrieve <- local({
    +   token_lengths <- list(c(5,5), 9, c(4,2,2,4,6))
    +   function(i) token_lengths[i]
    + })
    > 
    > d <- data.frame(id=c(1,2,3), token_lengths=1:3)
    > retrieve(d$token_lengths[2:3])
    [[1]]
    [1] 9
    
    [[2]]
    [1] 4 2 2 4 6
    
    0 讨论(0)
提交回复
热议问题