dcast changes content of dataframe

江枫思渺然 提交于 2019-12-11 13:57:58

问题


I tried using the reshape package to reshape a dataframe I got, but when using it, numbers in the dataframe are changed which should not be.

The dataframe contains several variables as well as multiple times these variables have been measured, for each person there are 6 rows, that is 6 times that person has been measured. Now I want to reshape the dataframe so there is only one row for each person instead of 6, that means every variable should be there 6 times (once for every measurement), this should easily be done with the following code:

melteddata <- melt(daten, id=(c("IDParticipant", "looporder")))

datenrestrukturiert <- dcast(melteddata, IDParticipant~looporder+variable)

with "daten" being the original dataframe, "looporder" being the variable that reflects the time of measurement (1-6), here an example (unfortunately I could not figure out how to post tables):

https://www.dropbox.com/s/8c9dm4rttedbzw1/daten.jpg?dl=0

or maybe this is fine:

structure(list(IDParticipant = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 
2L, 2L, 3L, 3L, 3L), looporder = c(1L, 2L, 3L, 5L, 6L, 2L, 3L, 
5L, 6L, 1L, 2L, 3L), pc_mean_1 = c(NA, 3.22222222222222, NA, 
3.22222222222222, 3.22222222222222, 3.66666666666667, 3.66666666666667, 
3.66666666666667, 3.66666666666667, 3.25, NA, 3.25), bd_mean_1 = c(NA, 
2.88888888888889, NA, 2.88888888888889, 2.88888888888889, 2.75, 
2.75, 2.75, 2.75, 4.08333333333333, NA, 4.08333333333333), sm = c(999, 
4, 999, 3.66666666666667, 1, 4, 4, 5, 5, 5, 999, 5), cm = c(999, 
1.33333333333333, 999, 2.33333333333333, 1, 2, 2, 2.33333333333333, 
1, 3, 999, 1.66666666666667)), .Names = c("IDParticipant", "looporder", 
"pc_mean_1", "bd_mean_1", "sm", "cm"), row.names = c(NA, 12L), class = "data.frame")

datenrestrukturiert looks as the following:

https://www.dropbox.com/s/al93lnj76y1j266/datenrestrukturiert.jpg?dl=0

I do not want to aggregate or anything, which is why I tried adding fun.aggregate = NULL without any change, also there is always the following message:

"Aggregation function missing: defaulting to length"

so far everything worked, but there is one problem: when using dcast (as well as cast) some numbers from variables are changed, mostly to "0" or "1", but usually there should be some other numbers like "3.44" or "4.77" or something like that, but they are changed to mostly "0" when cast is computed

Anybody got any hints why this could be?

Some more information that might help: when i import the dataset via read.csv2 I always get a strange name for the first variable, that is some more symbols in front of the variablename than shown in Excel: "ï..IDParticipant" which I rename to "IDParticipant", could that have anything to do with it?

another sidefact: running it with the sampleframe I provided, everything is fine, the original dataframe consists of 1404 rows and 353 variables, could it be too big for R?


回答1:


If you have duplicated combinations of your LHS and RHS variables, then you either need to (1) create a secondary level of IDs, or (2) perform some form of aggregation.

You can test for duplicates by using any(duplicated(...)).

Here's an example, using your existing sample of "daten" (which does not contain duplicates):

library(reshape2)

idvars <- c("IDParticipant", "looporder")
any(duplicated(daten[idvars]))
# [1] FALSE

melteddata <- melt(daten, id=idvars)
datenrestrukturiert <- dcast(melteddata, IDParticipant ~ looporder + variable)
datenrestrukturiert
#   IDParticipant 1_pc_mean_1 1_bd_mean_1 1_sm 1_cm 2_pc_mean_1 2_bd_mean_1 2_sm       2_cm 3_pc_mean_1
# 1             1          NA          NA  999  999    3.222222    2.888889    4   1.333333          NA
# 2             2          NA          NA   NA   NA    3.666667    2.750000    4   2.000000    3.666667
# 3             3        3.25    4.083333    5    3          NA          NA  999 999.000000    3.250000
#   3_bd_mean_1 3_sm       3_cm 5_pc_mean_1 5_bd_mean_1     5_sm     5_cm 6_pc_mean_1 6_bd_mean_1 6_sm
# 1          NA  999 999.000000    3.222222    2.888889 3.666667 2.333333    3.222222    2.888889    1
# 2    2.750000    4   2.000000    3.666667    2.750000 5.000000 2.333333    3.666667    2.750000    5
# 3    4.083333    5   1.666667          NA          NA       NA       NA          NA          NA   NA
#   6_cm
# 1    1
# 2    1
# 3   NA

However, since any(duplicated(...)) is giving you TRUE, you are likely to have something more similar to:

daten2 <- rbind(daten, daten[c(1, 4, 6), ])
any(duplicated(daten2[idvars]))
# [1] TRUE

In this case, you can consider using getanID from my "splitstackshape" package to conveniently add a secondary "id" to your dataset.

library(splitstackshape)

melteddata2 <- melt(getanID(daten2, idvars), c(".id", idvars))

datenrestrukturiert2 <- dcast.data.table(
  melteddata2, .id + IDParticipant ~ looporder + variable)

datenrestrukturiert2
#    .id IDParticipant 1_pc_mean_1 1_bd_mean_1 1_sm 1_cm 2_pc_mean_1 2_bd_mean_1 2_sm
# 1:   1             1          NA          NA  999  999    3.222222    2.888889    4
# 2:   1             2          NA          NA   NA   NA    3.666667    2.750000    4
# 3:   1             3        3.25    4.083333    5    3          NA          NA  999
# 4:   2             1          NA          NA  999  999          NA          NA   NA
# 5:   2             2          NA          NA   NA   NA    3.666667    2.750000    4
#          2_cm 3_pc_mean_1 3_bd_mean_1 3_sm       3_cm 5_pc_mean_1 5_bd_mean_1     5_sm
# 1:   1.333333          NA          NA  999 999.000000    3.222222    2.888889 3.666667
# 2:   2.000000    3.666667    2.750000    4   2.000000    3.666667    2.750000 5.000000
# 3: 999.000000    3.250000    4.083333    5   1.666667          NA          NA       NA
# 4:         NA          NA          NA   NA         NA    3.222222    2.888889 3.666667
# 5:   2.000000          NA          NA   NA         NA          NA          NA       NA
#        5_cm 6_pc_mean_1 6_bd_mean_1 6_sm 6_cm
# 1: 2.333333    3.222222    2.888889    1    1
# 2: 2.333333    3.666667    2.750000    5    1
# 3:       NA          NA          NA   NA   NA
# 4: 2.333333          NA          NA   NA   NA
# 5:       NA          NA          NA   NA   NA



回答2:


here is my solution basend on Anandas suggestions (thank you very much for that)

dataframe is "daten" containing many variables, e.g. "IDParticipant", "looporder" and "sm"

first we need to create an object containing the variables for the later use of the melt- and cast-function

idvars <- c("IDParticipant", "looporder")

as it turns out, there were duplicates in the dataframe with the same values in the two variables "IDParticipant" and "looporder", so we need to add another id-varaible to the dataframe when melting it, that is to be done with "getanID" from the splitstackshape-package

melteddata <- melt(getanID(daten, idvars), c(".id", idvars))

after adding an extra id-variable, we can finally cast the dataframe we need using the extra id-variable and the other variables

datenrestrukturiert <- dcast(melteddata, .id + IDParticipant ~ variable + looporder)



来源:https://stackoverflow.com/questions/32244915/dcast-changes-content-of-dataframe

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!