Dataframe within dataframe?

后端 未结 3 1437
孤街浪徒
孤街浪徒 2021-01-04 02:38

Consider this example:

df <- data.frame(id=1:10,var1=LETTERS[1:10],var2=LETTERS[6:15])

fun.split <- function(x) tolower(as.character(x))
df$new.letter         


        
相关标签:
3条回答
  • 2021-01-04 02:58

    The reason is because you assigned a single new column to a 2 column matrix output by apply. So, the result will be a matrix in a single column. You can convert it back to normal data.frame with

     do.call(data.frame, df)
    

    A more straightforward method will be to assign 2 columns and I use lapply instead of apply as there can be cases where the columns are of different classes. apply returns a matrix and with mixed class, the columns will be 'character' class. But, lapply gets the output in a list and preserves the class

    df[paste0('new.letters', names(df)[2:3])] <- lapply(df[2:3], fun.split)
    
    0 讨论(0)
  • 2021-01-04 02:59

    In this case R doesn't behave like one would expect but maybe if we dig deeper we can solve it. What is a data frame? as Norman Matloff says in his book (chapter 5):

    a data frame is a list, with the components of that list being equal-length vectors

    The following code might be useful to understand.

    class(df$new.letters)
    [1] "matrix"
    
    
    str(df)
    'data.frame':   10 obs. of  4 variables:
     $ id         : int  1 2 3 4 5 6 7 8 9 10
     $ var1       : Factor w/ 10 levels "A","B","C","D",..: 1 2 3 4 5 6 7 8 9 10
     $ var2       : Factor w/ 10 levels "F","G","H","I",..: 1 2 3 4 5 6 7 8 9 10
     $ new.letters: chr [1:10, 1:2] "a" "b" "c" "d" ...
      ..- attr(*, "dimnames")=List of 2
      .. ..$ : NULL
      .. ..$ : chr  "var1" "var2"
    

    Maybe the reason why it looks strange is in the print methods. Consider this:

    colnames(df$new.letters)
    [1] "var1" "var2"
    

    maybe there must something in the print methods that combine the sub-names of objects and display them all.

    For example here the vectors that constitute the df are:

    names(df)
    [1] "id"          "var1"        "var2"        "new.letters"
    

    but in this case the vector new.letters also has a dim attributes (in fact it is a matrix) were dimensions have names var1 and var1 too. See this code:

    attributes(df$new.letters)
    $dim
    [1] 10  2
    
    $dimnames
    $dimnames[[1]]
    NULL
    
    $dimnames[[2]]
    [1] "var1" "var2"
    

    but when we print we see all of them like they were separated vectors (and so columns of the data.frame!).

    Edit: Print methods

    Just for curiosity in order to improve this question I looked inside the methods of the print functions:

    methods(print)
    

    The previous code produces a very long list of methods for the generic function print but there is no one for data.frame. The one that looks for data frame (but I am sure there is a more technically way to find out that) is listof.

    getS3method("print", "listof")
    function (x, ...) 
    {
        nn <- names(x)
        ll <- length(x)
        if (length(nn) != ll) 
            nn <- paste("Component", seq.int(ll))
        for (i in seq_len(ll)) {
            cat(nn[i], ":\n")
            print(x[[i]], ...)
            cat("\n")
        }
        invisible(x)
    }
    <bytecode: 0x101afe1c8>
    <environment: namespace:base>
    

    Maybe I am wrong but It seems to me that in this code there might be useful informations about why that happens, specifically when the if (length(nn) != ll) is stated.

    0 讨论(0)
  • 2021-01-04 03:05

    @akrun solved 90% of my problem. But I had data.frames buried within data.frames, buried within data.frames and so on, without knowing the depth to which this was happening.

    In this case, I thought sharing my recursive solution might be helpful to others searching this thread as I was:

        unnest_dataframes <- function(x) {
    
            y <- do.call(data.frame, x)
    
            if("data.frame" %in% sapply(y, class)) unnest_dataframes(y)
    
            y
    
        }
    
        new_data <- unnest_dataframes(df)
    

    Although this itself sometimes has problems and it can be helpful to separate all columns of class "data.frame" from the original data set then cbind() it back together like so:

      # Find all columns that are data.frame
      # Assuming your data frame is stored in variable 'y'
      data.frame.cols <- unname(sapply(y, function(x) class(x) == "data.frame"))
      z <- y[, !data.frame.cols]
    
      # All columns of class "data.frame"
      dfs <- y[, data.frame.cols]
    
      # Recursively unnest each of these columns
      unnest_dataframes <- function(x) {
        y <- do.call(data.frame, x)
        if("data.frame" %in% sapply(y, class)) {
            unnest_dataframes(y)
        } else {
            cat('Nested data.frames successfully unpacked\n')
          }
        y
      }
    
      df2 <- unnest_dataframes(dfs)
    
      # Combine with original data
      all_columns <- cbind(z, df2)
    
    0 讨论(0)
提交回复
热议问题