问题
I'm trying to convert a data frame from wide to long format using the melt formula. The challenge is that I have multiple column names that are labeled the same. When I use the melt function, it drops the values from the repeat column. I have read similar questions and it was advised to use the reshape function, however I was not able to get it work.
To reproduce my starting data frame:
conversion.id<-c("1", "2", "3")
interaction.num<-c("1","1","1")
interaction.num2<-c("2","2","2")
conversion.id<-as.data.frame(conversion.id)
interaction.num<-as.data.frame(interaction.num)
interaction.num2<-as.data.frame(interaction.num2)
conversion<-c(rep("1",3))
conversion<-as.data.frame(conversion)
df<-cbind(conversion.id,interaction.num, interaction.num2, conversion)
names(df)[3]<-"interaction.num"
The data frame looks like the following:
When I run the following melt function:
melt.df<-melt(df,id="conversion.id")
It drops the interaction.num == 2 column and looks something like this:
The data frame I want is the following:
I saw the following post, but I'm not too familiar with the reshape function and wasn't able to get it to work.
How to reshape a dataframe with "reoccurring" columns?
And to add a layer of complexity, I'm looking for a method that is efficient. I need to perform this on a data frame that is around a 1M rows with many columns labeled the same.
Any advice would be greatly appreciated!
回答1:
Here is a solution using tidyr
instead of reshape2
. One of the advantages is the gather_
function, which takes character vectors as inputs. So, first we can replace all the "problematic" variable names with unique names (by adding numbers to the end of each name) and then we can gather (the equivalent of melt) these specific variables. The unique names of the variables are stored in a temporary variable called "prob_var_name", which I removed at the end.
library(tidyr)
library(dplyr)
var_name <- "interaction.num"
problem_var <- df %>%
names %>%
equals(var_name) %>%
which
replaced_names <- mapply(paste0,names(df)[problem_var],seq_along(problem_var))
names(df)[problem_var] <- replaced_names
df %>%
gather_("prob_var_name",var_name,replaced_names) %>%
select(-prob_var_name)
conversion.id conversion interaction.num
1 1 1 1
2 2 1 1
3 3 1 1
4 1 1 2
5 2 1 2
6 3 1 2
Thanks to the quoting ability of gather_
, you could wrap all this into a function and set var_name
to a variable. Then maybe you could use it on all of your duplicated variables?
回答2:
Here's a solution using data.table
. You just have to provide the index instead of names.
require(data.table)
require(reshape2)
ans <- melt(setDT(df), measure=2:3,
value.name="interaction.num")[, variable := NULL]
# conversion.id conversion interaction.num
# 1: 1 1 1
# 2: 2 1 1
# 3: 3 1 1
# 4: 1 1 2
# 5: 2 1 2
# 6: 3 1 2
You can get the indices 2:3
by doing grep("interaction.num", names(df))
.
回答3:
Here's an approach in base R that should work for you:
x <- grep("interaction.num", names(df)) ## as suggested by Arun
## Make more friendly names for reshape
names(df)[x] <- paste(names(df)[x], seq_along(x), sep = "_")
## Reshape
reshape(df, direction = "long",
idvar=c("conversion.id", "conversion"),
varying = x, sep = "_")
# conversion.id conversion time interaction.num
# 1.1.1 1 1 1 1
# 2.1.1 2 1 1 1
# 3.1.1 3 1 1 1
# 1.1.2 1 1 2 2
# 2.1.2 2 1 2 2
# 3.1.2 3 1 2 2
Another possibility is stack
instead of reshape
:
x <- grep("interaction.num", names(df)) ## as suggested by Arun
cbind(df[-x], stack(lapply(df[x], as.character)))
The lapply(df[x], as.character)
may not be necessary depending on if your values are actually numeric or not. The way you created this sample data, they were factor
s.
来源:https://stackoverflow.com/questions/23883160/reshape-data-frame-from-wide-to-long-with-re-occuring-column-names-in-r