dcast fails to cast character column when the data size is large

假如想象 提交于 2019-12-06 10:31:43

The problem is not the size of the dataset itself but the occurrence of duplicate entries in the full dataset. By picking smaller subsets from the full dataset there is a chance that no duplicates are included.

help("dcast", "data.table") says:

When variable combinations in formula doesn't identify a unique value in a cell, fun.aggregate will have to be specified, which defaults to length if unspecified.

How to find duplicates in the full dataset

All occurrences of duplicates can be identified by

cols <- c("id", "var")
df[duplicated(df, by = cols) | duplicated(df, by = cols, fromLast = TRUE)][
  order(id)]
   id          var           val
1:  1 subcontinent     Caribbean
2:  1 subcontinent South America

Note that we are looking for duplicates in id and var as these two form the cells, i.e., rows and columns, of the reshaped result.

Why unique() doesn't work

NB: This is the explanation why simply taking unique(df) will not work:

unique(df)
   id          var           val
1:  1    continent       America
2:  1 subcontinent     Caribbean
3:  2    continent        Europe
4:  2 subcontinent   West Europe
5:  1 subcontinent South America

does not remove any rows. Consequently,

dcast(unique(df), id ~ var, value.var = "val")
Aggregate function missing, defaulting to 'length'
   id continent subcontinent
1:  1         1            2
2:  2         1            1

Whereas

unique(df, by = cols)
   id          var         val
1:  1    continent     America
2:  1 subcontinent   Caribbean
3:  2    continent      Europe
4:  2 subcontinent West Europe

has removed the duplicate var for id == 1L. Consequently,

dcast(unique(df, by = cols), id ~ var, value.var = "val")
   id continent subcontinent
1:  1   America    Caribbean
2:  2    Europe  West Europe

How to find the row numbers of duplicated rows

The OP has reported that the issue appears only with the full dataset but not with a subset of the first 1e5 rows.

The row indices of the duplicate entries can be found by

which(duplicated(df, by = cols))

which returns 5 for the sample dataset. For OP's full dataset, I suspect that

min(which(duplicated(df, by = cols))) > 1e5

is true, i.e., there are no duplicates within the first 1e5 rows.

How to create character columns even in case of duplicate entries

OP's own approach using fun.aggregate = function(x) paste(x[1L]) as well as applying unique() on df just aim at removing any disturbing duplicates. The duplicates will be silently dropped.

Alternatively, toString() can be used as aggregation function which shows the duplicate entries:

dcast(df, id ~ var, toString, value.var = "val")
   id continent             subcontinent
1:  1   America Caribbean, South America
2:  2    Europe              West Europe

Data

library(data.table)
df <- data.table(
  id  = c(1L, 1L, 2L, 2L, 1L),
  var = c("continent", "subcontinent", "continent", "subcontinent", "subcontinent"),
  val = c("America", "Caribbean", "Europe", "West Europe", "South America")
)

df
   id          var           val
1:  1    continent       America
2:  1 subcontinent     Caribbean
3:  2    continent        Europe
4:  2 subcontinent   West Europe
5:  1 subcontinent South America

This is likely caused by duplicate entries.

So it is possible in your subset you inadvertently remove duplicate row entries.

Output using a DF with duplicate entries

df = data.table(id  = c(1,2,3,4,2), 
                var = c('continent','subcontinent','continent','continent','subcontinent'), 
                val = c('America','Caribbean','Africa','Europe','Caribbean'))

dcast(df, id ~ var, value.var ='val')

Aggregate function missing, defaulting to 'length'
   id continent subcontinent
1:  1         1            0
2:  2         0            2
3:  3         1            0
4:  4         1            0

Try unique()

If the duplicates aren't of importance might be worth:

dcast(unique(df), id ~ var, value.var ='val')

   id continent subcontinent
1:  1   America           NA
2:  2        NA    Caribbean
3:  3    Africa           NA
4:  4    Europe           NA 

I found a quick fix to this problem by including the aggregate function. It works now but I'm still not sure why dcast behave different for different size of data.

dcast(df, id ~ var, value.var ='val', fun.aggregate =function(x) paste(x[1])

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!