问题
Pardon my new-ness to the R world, thank you kindly in advance for your help.
I would like to analyze the data from an experiment.
The data comes in in Long format, and it needs to be reshaped into wide, but I cannot figure out exactly how to do it. Most of the examples for melt/cast and reshape deal with much simpler dataframes.
Each time the subject answers a question on the experiment, his userid, location, age, and gender are recorded in a single row, then his experimental data on a series of questions are inputed next to those variables. Here's the thing, they may answer any number of questions on the experiment, and they may answer different items (it is quite complicated, but it must be this way).
The raw data looks something like this:
User_id, location, age, gender, Item, Resp
1, CA, 22, M, A, 1
1, CA, 22, M, B, -1
1, CA, 22, M, C, -1
1, CA, 22, M, D, 1
1, CA, 22, M, E,-1
2, MD, 27, F, A, -1
2, MD, 27, F, B, 1
2, MD, 27, F, C, 1
2, MD, 27, F, E, 1
2, MD, 27, F, G, -1
2, MD, 27, F, H, -1
I would like to reshape this data to have each user be on a single row, to look like this:
User_id, location, age, gender, A, B, C, D, E, F, G, H
1, CA, 22, M, 1, -1, -1, 1, -1, 0, 0, 0,
2, MD, 27, F, -1, 1, 1, 1, 0, 1, -1, -1
I think this is just a matter of finding the right reshape equation, but I've been at it for a couple of hours and I can't quite get what I want it too look like, since most of the examples do not have the repeated demographic data, and thus can just be rotated more simply. Very sorry if I have overlooked something simple.
回答1:
Using data.table
you can do:
library(data.table)
> dcast(dt, User_id + location + age ~ Item, value.var = "Resp", fill = 0L)
User_id location age A B C D E G H
1: 1 CA 22 1 -1 -1 1 -1 0 0
2: 2 MD 27 -1 1 1 0 1 -1 -1
回答2:
There’s a package called tidyr that makes melting and reshaping data frames much easier. In your case, you can use tidyr::spread
straightforwardly:
result = spread(df, Item, Resp)
This will however fill missing entries with NA
:
User_id location age gender A B C D E G H
1 1 CA 22 M 1 -1 -1 1 -1 NA NA
2 2 MD 27 F -1 1 1 NA 1 -1 -1
You can fix this by replacing them:
result[is.na(result)] = 0
result
# User_id location age gender A B C D E G H
# 1 1 CA 22 M 1 -1 -1 1 -1 0 0
# 2 2 MD 27 F -1 1 1 0 1 -1 -1
… or by using the fill
argument:
result = spread(df, Item, Resp, fill = 0)
For completeness’ sake, the other way round (i.e. reproducing the original data.frame
) works via gather
(this is usually known as “melting”):
gather(result, Item, Resp, A : H)
— The last argument here tells gather
which columns to gather (and it supports the concise range syntax).
回答3:
Here's the always elegant stats::reshape
version
(newdf <- reshape(df, direction = "wide", timevar = "Item", idvar = names(df)[1:4]))
# User_id location age gender Resp. A Resp. B Resp. C Resp. D Resp. E Resp. G Resp. H
# 1 1 CA 22 M 1 -1 -1 1 -1 NA NA
# 6 2 MD 27 F -1 1 1 NA 1 -1 -1
Missing values get filled with NA
in reshape()
, and the names are not what we want. So we'll need to do a bit more work. Here we can change the names and replace the NA
s with zero in the same line to arrive at your desired result.
replace(setNames(newdf, sub(".* ", "", names(newdf))), is.na(newdf), 0)
# User_id location age gender A B C D E G H
# 1 1 CA 22 M 1 -1 -1 1 -1 0 0
# 6 2 MD 27 F -1 1 1 0 1 -1 -1
Of course, the code would definitely be more legible if we broke this up into two separate lines. Also, note that there is no F
in Item
in your original data, hence the difference in output from yours.
Data:
df <- structure(list(User_id = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L, 2L), location = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L), .Label = c(" CA", " MD"), class = "factor"), age = c(22L,
22L, 22L, 22L, 22L, 27L, 27L, 27L, 27L, 27L, 27L), gender = structure(c(2L,
2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c(" F", " M"
), class = "factor"), Item = structure(c(1L, 2L, 3L, 4L, 5L,
1L, 2L, 3L, 5L, 6L, 7L), .Label = c(" A", " B", " C", " D", " E",
" G", " H"), class = "factor"), Resp = c(1, -1, -1, 1, -1, -1,
1, 1, 1, -1, -1)), .Names = c("User_id", "location", "age", "gender",
"Item", "Resp"), class = "data.frame", row.names = c(NA, -11L
))
来源:https://stackoverflow.com/questions/32061021/reshape-r-data-with-user-entries-in-rows-collapsing-for-each-user