Reshape R data with user entries in rows, collapsing for each user

筅森魡賤 提交于 2020-01-19 14:12:08

问题


Pardon my new-ness to the R world, thank you kindly in advance for your help.

I would like to analyze the data from an experiment.

The data comes in in Long format, and it needs to be reshaped into wide, but I cannot figure out exactly how to do it. Most of the examples for melt/cast and reshape deal with much simpler dataframes.

Each time the subject answers a question on the experiment, his userid, location, age, and gender are recorded in a single row, then his experimental data on a series of questions are inputed next to those variables. Here's the thing, they may answer any number of questions on the experiment, and they may answer different items (it is quite complicated, but it must be this way).

The raw data looks something like this:

User_id, location, age, gender, Item, Resp
1, CA, 22, M, A, 1 
1, CA, 22, M, B, -1 
1, CA, 22, M, C, -1 
1, CA, 22, M, D, 1 
1, CA, 22, M, E,-1
2, MD, 27, F, A, -1 
2, MD, 27, F, B, 1 
2, MD, 27, F, C, 1 
2, MD, 27, F, E, 1 
2, MD, 27, F, G, -1 
2, MD, 27, F, H, -1 

I would like to reshape this data to have each user be on a single row, to look like this:

User_id, location, age, gender, A, B, C, D, E, F, G, H
1, CA, 22, M, 1, -1, -1, 1, -1, 0, 0, 0, 
2, MD, 27, F, -1, 1, 1, 1, 0, 1, -1, -1

I think this is just a matter of finding the right reshape equation, but I've been at it for a couple of hours and I can't quite get what I want it too look like, since most of the examples do not have the repeated demographic data, and thus can just be rotated more simply. Very sorry if I have overlooked something simple.


回答1:


Using data.table you can do:

library(data.table)
> dcast(dt, User_id + location + age ~ Item, value.var = "Resp", fill = 0L)
   User_id location age  A  B  C  D  E  G  H
1:       1       CA  22  1 -1 -1  1 -1  0  0
2:       2       MD  27 -1  1  1  0  1 -1 -1



回答2:


There’s a package called tidyr that makes melting and reshaping data frames much easier. In your case, you can use tidyr::spread straightforwardly:

result = spread(df, Item, Resp)

This will however fill missing entries with NA:

  User_id location age gender  A  B  C  D  E  G  H
1       1       CA  22      M  1 -1 -1  1 -1 NA NA
2       2       MD  27      F -1  1  1 NA  1 -1 -1

You can fix this by replacing them:

result[is.na(result)] = 0
result
#   User_id location age gender  A  B  C  D  E  G  H
# 1       1       CA  22      M  1 -1 -1  1 -1  0  0
# 2       2       MD  27      F -1  1  1  0  1 -1 -1

… or by using the fill argument:

result = spread(df, Item, Resp, fill = 0)

For completeness’ sake, the other way round (i.e. reproducing the original data.frame) works via gather (this is usually known as “melting”):

gather(result, Item, Resp, A : H)

— The last argument here tells gather which columns to gather (and it supports the concise range syntax).




回答3:


Here's the always elegant stats::reshape version

(newdf <- reshape(df, direction = "wide", timevar = "Item", idvar = names(df)[1:4]))
#   User_id location age gender Resp. A Resp. B Resp. C Resp. D Resp. E Resp. G Resp. H
# 1       1       CA  22      M       1      -1      -1       1      -1      NA      NA
# 6       2       MD  27      F      -1       1       1      NA       1      -1      -1

Missing values get filled with NA in reshape(), and the names are not what we want. So we'll need to do a bit more work. Here we can change the names and replace the NAs with zero in the same line to arrive at your desired result.

replace(setNames(newdf, sub(".* ", "", names(newdf))), is.na(newdf), 0)
#   User_id location age gender  A  B  C D  E  G  H
# 1       1       CA  22      M  1 -1 -1 1 -1  0  0
# 6       2       MD  27      F -1  1  1 0  1 -1 -1

Of course, the code would definitely be more legible if we broke this up into two separate lines. Also, note that there is no F in Item in your original data, hence the difference in output from yours.

Data:

df <- structure(list(User_id = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 
2L, 2L), location = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 
2L, 2L, 2L), .Label = c(" CA", " MD"), class = "factor"), age = c(22L, 
22L, 22L, 22L, 22L, 27L, 27L, 27L, 27L, 27L, 27L), gender = structure(c(2L, 
2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c(" F", " M"
), class = "factor"), Item = structure(c(1L, 2L, 3L, 4L, 5L, 
1L, 2L, 3L, 5L, 6L, 7L), .Label = c(" A", " B", " C", " D", " E", 
" G", " H"), class = "factor"), Resp = c(1, -1, -1, 1, -1, -1, 
1, 1, 1, -1, -1)), .Names = c("User_id", "location", "age", "gender", 
"Item", "Resp"), class = "data.frame", row.names = c(NA, -11L
))


来源:https://stackoverflow.com/questions/32061021/reshape-r-data-with-user-entries-in-rows-collapsing-for-each-user

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!