I'm using the dcast
function in the library(reshape2)
package to cast a simple table of three columns
df = data.table(id = 1:1e6,
var = c('continent','subcontinent',...),
val = c('America','Caribbean',...)````
by dcast(df, id ~ var, value.var ='val')
and it automatically converts the value to the count, i.e.
id continent subcontinent
1 1 1
2 1 1
However, if I reduce the size to 10000 rows, it correctly outputs
id continent subcontinent
1 America Caribbean
2 Europe West Europe
Is this a bug or I need to change the code somehow? Please help. Thanks!
The problem is not the size of the dataset itself but the occurrence of duplicate entries in the full dataset. By picking smaller subsets from the full dataset there is a chance that no duplicates are included.
help("dcast", "data.table")
says:
When variable combinations in
formula
doesn't identify a unique value in a cell,fun.aggregate
will have to be specified, which defaults tolength
if unspecified.
How to find duplicates in the full dataset
All occurrences of duplicates can be identified by
cols <- c("id", "var")
df[duplicated(df, by = cols) | duplicated(df, by = cols, fromLast = TRUE)][
order(id)]
id var val 1: 1 subcontinent Caribbean 2: 1 subcontinent South America
Note that we are looking for duplicates in id
and var
as these two form the cells, i.e., rows and columns, of the reshaped result.
Why unique()
doesn't work
NB: This is the explanation why simply taking unique(df)
will not work:
unique(df)
id var val 1: 1 continent America 2: 1 subcontinent Caribbean 3: 2 continent Europe 4: 2 subcontinent West Europe 5: 1 subcontinent South America
does not remove any rows. Consequently,
dcast(unique(df), id ~ var, value.var = "val")
Aggregate function missing, defaulting to 'length' id continent subcontinent 1: 1 1 2 2: 2 1 1
Whereas
unique(df, by = cols)
id var val 1: 1 continent America 2: 1 subcontinent Caribbean 3: 2 continent Europe 4: 2 subcontinent West Europe
has removed the duplicate var
for id == 1L
. Consequently,
dcast(unique(df, by = cols), id ~ var, value.var = "val")
id continent subcontinent 1: 1 America Caribbean 2: 2 Europe West Europe
How to find the row numbers of duplicated rows
The OP has reported that the issue appears only with the full dataset but not with a subset of the first 1e5
rows.
The row indices of the duplicate entries can be found by
which(duplicated(df, by = cols))
which returns 5
for the sample dataset. For OP's full dataset, I suspect that
min(which(duplicated(df, by = cols))) > 1e5
is true, i.e., there are no duplicates within the first 1e5 rows.
How to create character columns even in case of duplicate entries
OP's own approach using fun.aggregate = function(x) paste(x[1L])
as well as applying unique()
on df
just aim at removing any disturbing duplicates. The duplicates will be silently dropped.
Alternatively, toString()
can be used as aggregation function which shows the duplicate entries:
dcast(df, id ~ var, toString, value.var = "val")
id continent subcontinent 1: 1 America Caribbean, South America 2: 2 Europe West Europe
Data
library(data.table)
df <- data.table(
id = c(1L, 1L, 2L, 2L, 1L),
var = c("continent", "subcontinent", "continent", "subcontinent", "subcontinent"),
val = c("America", "Caribbean", "Europe", "West Europe", "South America")
)
df
id var val 1: 1 continent America 2: 1 subcontinent Caribbean 3: 2 continent Europe 4: 2 subcontinent West Europe 5: 1 subcontinent South America
This is likely caused by duplicate entries.
So it is possible in your subset you inadvertently remove duplicate row entries.
Output using a DF with duplicate entries
df = data.table(id = c(1,2,3,4,2),
var = c('continent','subcontinent','continent','continent','subcontinent'),
val = c('America','Caribbean','Africa','Europe','Caribbean'))
dcast(df, id ~ var, value.var ='val')
Aggregate function missing, defaulting to 'length'
id continent subcontinent
1: 1 1 0
2: 2 0 2
3: 3 1 0
4: 4 1 0
Try unique()
If the duplicates aren't of importance might be worth:
dcast(unique(df), id ~ var, value.var ='val')
id continent subcontinent
1: 1 America NA
2: 2 NA Caribbean
3: 3 Africa NA
4: 4 Europe NA
I found a quick fix to this problem by including the aggregate function. It works now but I'm still not sure why dcast
behave different for different size of data.
dcast(df, id ~ var, value.var ='val', fun.aggregate =function(x) paste(x[1])
来源:https://stackoverflow.com/questions/52355601/dcast-fails-to-cast-character-column-when-the-data-size-is-large