Creating new data frames from a larger data frame using a list

前端 未结 1 864
深忆病人
深忆病人 2021-01-24 16:23

I have a data frame that contains multiple data points for a large number of samples. Here is a shortened example with 3 samples each with 3 data points:

Assay           


        
相关标签:
1条回答
  • 2021-01-24 16:48

    The split command is the easiest way to turn this into a list of data.frame objects split on sample.

    myList <- split(mydf, mydf$Sample)
    

    The items can be accessed in the list by numeric indexing (i.e. myList[[1]]) or by the name of the unique item in the variable Sample (i.e. myList$sam1).

    The numeric indexing is obvioustly handy when you're going through a sequence but you can still use the name for that as well.

     #get names of the unique items in sample
     nam <- unique(mydf$Sample)
     #as a test look at the first few rows of each of my data.frames
     for( i in nam) print( head(myList[[i]]) )
     #another way to use access to the data.frame is the with() statement
     for( i in nam) with(myList[[i]], print( Assay[1:2] )
    

    That's not necessarily the most efficient R syntax but hopefully it gets you farther along in actually using your list of data.frame objects.

    Now, that gives you what you asked for but here's some advice on what you asked for. Don't do it. Just learn to properly acccess your data.frame object. You could just as easily not make the list up and go through all of the unique instances of Sample in your code... including saving them out as separate files. The advantage of that is that you can do lots of nifty vectorized commands on your intact data.frame across Sample that are much harder on the list. Just stick with you nice big data.frame.

    Here are a couple of simple examples. Look at what I did above for just getting the first few lines of each of the separate data.frame objects in the list. Here's something similar just run on the big data.frame.

    lapply( unique(mydf$Sample), function(x) print(head( mydf[ mydf$Sample == x,] )) )
    

    How about something more meaningful? Let's say I want a count of each individual Genotype separated by Sample.

    table( mydf$Genotype, mydf$Sample)
    

    That's much easier than what you'd have to do with the big list. There's lots of functions like that you'll want to sue on your intact data.frame like tapply and aggregate. Even if you wanted to do something that seems like it might be easier with the data.frame broken up, like sorting within each Sample level, it's easier with the data.frame.

    mydf[ order(mydf$Sample, mydf$Assay), ]
    

    That will order by Sample and then by Assay nested within Sample.

    When I started R I thought that splitting up data.frame objects was the way to go and used it a lot. Since I've learned R better I never ever do that. I don't have a single bit of R code written after the few weeks with R that ever splits up the data.frame into a list. I'm not saying you should never do it. I'm just saying that it's relatively rare that you need it or that it's the best idea. You might want to post a query on here about your end goal and get some advice on that.

    0 讨论(0)
提交回复
热议问题