Reshape data frame from wide to long with re-occuring column names in R

。_饼干妹妹 提交于 2019-12-22 09:31:50

问题


I'm trying to convert a data frame from wide to long format using the melt formula. The challenge is that I have multiple column names that are labeled the same. When I use the melt function, it drops the values from the repeat column. I have read similar questions and it was advised to use the reshape function, however I was not able to get it work.

To reproduce my starting data frame:

conversion.id<-c("1", "2", "3")
interaction.num<-c("1","1","1")
interaction.num2<-c("2","2","2")
conversion.id<-as.data.frame(conversion.id)
interaction.num<-as.data.frame(interaction.num)
interaction.num2<-as.data.frame(interaction.num2)
conversion<-c(rep("1",3))
conversion<-as.data.frame(conversion)
df<-cbind(conversion.id,interaction.num, interaction.num2, conversion)
names(df)[3]<-"interaction.num"

The data frame looks like the following:

When I run the following melt function:

melt.df<-melt(df,id="conversion.id")

It drops the interaction.num == 2 column and looks something like this:

The data frame I want is the following:

I saw the following post, but I'm not too familiar with the reshape function and wasn't able to get it to work.

How to reshape a dataframe with "reoccurring" columns?

And to add a layer of complexity, I'm looking for a method that is efficient. I need to perform this on a data frame that is around a 1M rows with many columns labeled the same.

Any advice would be greatly appreciated!


回答1:


Here is a solution using tidyr instead of reshape2. One of the advantages is the gather_ function, which takes character vectors as inputs. So, first we can replace all the "problematic" variable names with unique names (by adding numbers to the end of each name) and then we can gather (the equivalent of melt) these specific variables. The unique names of the variables are stored in a temporary variable called "prob_var_name", which I removed at the end.

library(tidyr)
library(dplyr)

var_name <- "interaction.num"

problem_var <- df %>% 
  names %>% 
  equals(var_name) %>%
  which

replaced_names <- mapply(paste0,names(df)[problem_var],seq_along(problem_var))

names(df)[problem_var]  <- replaced_names

df %>%
  gather_("prob_var_name",var_name,replaced_names) %>%
  select(-prob_var_name)

  conversion.id conversion interaction.num
1             1          1               1
2             2          1               1
3             3          1               1
4             1          1               2
5             2          1               2
6             3          1               2

Thanks to the quoting ability of gather_, you could wrap all this into a function and set var_name to a variable. Then maybe you could use it on all of your duplicated variables?




回答2:


Here's a solution using data.table. You just have to provide the index instead of names.

require(data.table)
require(reshape2)
ans <- melt(setDT(df), measure=2:3, 
           value.name="interaction.num")[, variable := NULL]

#    conversion.id conversion interaction.num
# 1:             1          1               1
# 2:             2          1               1
# 3:             3          1               1
# 4:             1          1               2
# 5:             2          1               2
# 6:             3          1               2

You can get the indices 2:3 by doing grep("interaction.num", names(df)).




回答3:


Here's an approach in base R that should work for you:

x <- grep("interaction.num", names(df)) ## as suggested by Arun

## Make more friendly names for reshape
names(df)[x] <- paste(names(df)[x], seq_along(x), sep = "_")

## Reshape
reshape(df, direction = "long", 
        idvar=c("conversion.id", "conversion"), 
        varying = x, sep = "_")
#       conversion.id conversion time interaction.num
# 1.1.1             1          1    1               1
# 2.1.1             2          1    1               1
# 3.1.1             3          1    1               1
# 1.1.2             1          1    2               2
# 2.1.2             2          1    2               2
# 3.1.2             3          1    2               2

Another possibility is stack instead of reshape:

x <- grep("interaction.num", names(df)) ## as suggested by Arun
cbind(df[-x], stack(lapply(df[x], as.character)))

The lapply(df[x], as.character) may not be necessary depending on if your values are actually numeric or not. The way you created this sample data, they were factors.



来源:https://stackoverflow.com/questions/23883160/reshape-data-frame-from-wide-to-long-with-re-occuring-column-names-in-r

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!