How to expand a large dataframe in R

落爺英雄遲暮 提交于 2020-01-19 15:05:48

问题


I have a dataframe

df <- data.frame(
  id = c(1, 1, 1, 2, 2, 3, 3, 3, 3, 4), 
  date = c("1985-06-19", "1985-06-19", "1985-06-19", "1985-08-01", 
           "1985-08-01", "1990-06-19", "1990-06-19", "1990-06-19", 
           "1990-06-19", "2000-05-12"), 
  spp = c("a", "b", "c", "c", "d", "b", "c", "d", "a", "b"),
  y = rpois(10, 5))

   id       date spp y
1   1 1985-06-19   a 6
2   1 1985-06-19   b 3
3   1 1985-06-19   c 7
4   2 1985-08-01   c 7
5   2 1985-08-01   d 6
6   3 1990-06-19   b 5
7   3 1990-06-19   c 4
8   3 1990-06-19   d 4
9   3 1990-06-19   a 6
10  4 2000-05-12   b 6

I want to expand it so that there is every combination of id and spp and have y = 0 for every combination that is not currently in the dataframe. The dataframe is currently about 100,000 rows and 15 columns. When expanded it would be about 300,000 columns (there are 17 unique values of spp in my actual dataset).

For every value of id the date is the same (e.g. when id = 2, date always = 1985-08-01). In my real dataset all the columns except spp and y can be specified by the id.

I want to end up with something like:

   id       date spp y
   1 1985-06-19   a 6
   1 1985-06-19   b 3
   1 1985-06-19   c 7
   1 1985-06-19   d 0*
   2 1985-08-01   a 0*
   2 1985-08-01   b 0*
   2 1985-08-01   c 7
   2 1985-08-01   d 6
   3 1990-06-19   b 5
   3 1990-06-19   c 4
   3 1990-06-19   d 4
   3 1990-06-19   a 6
   4 2000-05-12   a 0*
   4 2000-05-12   b 6
   4 2000-05-12   c 0*
   4 2000-05-12   d 0*
  • Indicate added rows

I will likely have to do this in the future with potentially much larger data frames, so a quick, efficient (time and memory) way to do this would be appreciated but any solution would satisfy me. I figure there should be ways to use the dplyr, data.table, or reshape packages but I'm not very familiar with any of them. I'm not sure if it would be easiest to expand just rows id, spp, and y, then do a left_join() or merge() to recombine date (and all the other variables in my real dataframe) based on id?


回答1:


expand.grid is a useful function here,

mergedData <- merge(
    expand.grid(id = unique(df$id), spp = unique(df$spp)),
    df, by = c("id", "spp"), all =T)

mergedData[is.na(mergedData$y), ]$y <- 0

mergedData$date <- rep(levels(df$date),
                       each = length(levels(df$spp)))

Since you're not actually doing anything to subsets of the data I don't think plyr will help, maybe more efficient ways with data.table.




回答2:


I would go the second way, hope this helps

x<-unique(df$id)
y<-unique(df$spp)
newdf<-data.frame(x=rep(x,each=length(y)),y=rep(y, length(x)))
merged<-merge(newdf, df, by.x=c(x,y), by.y=c("id","spp"), all=T)



回答3:


There is a new function complete in the development version of tidyr that does this. Of course complete uses expand.grid internally.

# get new version of tidyr
devtools::install_github("hadley/tidyr")
# load package
require(tidyr)
# calculations
complete(df, c(id, date), spp, fill = list(y = 0))
##    id       date spp y
## 1   1 1985-06-19   a 5
## 2   1 1985-06-19   b 3
## 3   1 1985-06-19   c 5
## 4   1 1985-06-19   d 0
## 5   2 1985-08-01   a 0
## 6   2 1985-08-01   b 0
## 7   2 1985-08-01   c 4
## 8   2 1985-08-01   d 9
## 9   3 1990-06-19   a 8
## 10  3 1990-06-19   b 3
## 11  3 1990-06-19   c 5
## 12  3 1990-06-19   d 6
## 13  4 2000-05-12   a 0
## 14  4 2000-05-12   b 3
## 15  4 2000-05-12   c 0
## 16  4 2000-05-12   d 0


来源:https://stackoverflow.com/questions/22059308/how-to-expand-a-large-dataframe-in-r

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!