Intradataframe Analysis--creating a derivative data frame from another data frame

蹲街弑〆低调 提交于 2019-12-13 19:17:35

问题


This may be a little obtuse of a question title since I'm still getting up to speed with R but I'm doing some data frame manipulation to extract certain percentages regarding classification groups that are captured by one column that is a factor against another column I wish to obtain percentages from. I'll use the built in mtcars to demonstrate what I'm trying to achieve, where gear is playing the role of the classification variable, and cyl is the data I'm trying to get percentages from.

Just some background details to smooth the question:

The gear column spans 3 distinct values, 3,4,5. The cyl column spans 3 distinct values as well, 4,6,8

The first element of my list says what percentage of gear types have at most 4 cylinders. For 3-gear models there is only one, the Toyota Corona, out of a total of 15 3-gear models, and thus the percentage should be 1/15 = 0.0667. For 4-gear models there are eight out of a total of 12 4-gear models, to yield 8/12 = 0.667.

Now here's the method I wrote to do this computation. However the structure of the output is not what I desire. What I'd like instead is to merge this all into a data frame with the first column being the distinct cyl values and the other columns being the 3, 4, and 5 for the gear types, where the rows are the various percentages. I'm very close but need some help doing the data reshaping of the list I am currently achieving or maybe even exercising an alternative apply function that will achieve the table of percentages I'm chasing after, or any other magic someone can cook up.

>  lapply( unique( sort( y$cyl ) ) , function(c) { tapply( y$cyl , y$gear , function(x) sum( x <= c ) / length(x) ) } ) 
[[1]]
         3          4          5 
0.06666667 0.66666667 0.40000000 

[[2]]
  3   4   5 
0.2 1.0 0.6 

[[3]]
3 4 5 
1 1 1 

This is what we could expect the data frame I desire to appear as

  cyl         X3        X4  X5
1   4 0.06666667 0.6666667 0.4
2   6 0.20000000 1.0000000 0.6
3   8 1.00000000 1.0000000 1.0

回答1:


I came up with a solution after googling "convert list of arrays into data.frame", which immediately lead me to the following SO post.

p <- lapply( unique( sort( mtcars$cyl ) ) , function(c) { tapply( mtcars$cyl , mtcars$gear , function(x) sum( x <= c ) / length(x) ) } )

> df <- data.frame( matrix( unlist(p) , nrow = length(p) , byrow=T ) )
> df
          X1        X2  X3
1 0.06666667 0.6666667 0.4
2 0.20000000 1.0000000 0.6
3 1.00000000 1.0000000 1.0

The solution works apart from the dropping of the classification names as the column headers, but it looks like with a follow up assignment this can be recovered as well...

> colnames(df) <- names(p[[1]])
> rownames(df) <- unique( sort( mtcars$cyl ) )
> df
           3         4   5
4 0.06666667 0.6666667 0.4
6 0.20000000 1.0000000 0.6
8 1.00000000 1.0000000 1.0

Actually, other answers to the linked question nicely address the column headers issue, the row header problem remains since those values get lost in my anonymous function calls.



来源:https://stackoverflow.com/questions/26534438/intradataframe-analysis-creating-a-derivative-data-frame-from-another-data-fram

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!