This is indeed a duplicate for this question r-split-string-using-tidyrseparate, but I cannot use the MWE for my purpose, because I do not know how to adjust the regular Expression. I basically want the same thing, but split the variable after the last underscore.
Reason: I have data where some columns show up several times for the same factor/type. I figured I can melt the data separate the value variable before the type string and spread it out again to a wide format with less columns. My Problem is that my variable names have different several underscores and I would like to learn how to separate after the last underscore which I added beforehand.
MWE
library(tidyr)
library(data.table)
dt<-data.table(Name=c("A","B","C"),Var_1_EVU=c(2,NA,NA),Var_1_BdS=c(NA,3,4),Var_2_BdS=c(NA,3,4))
dt.long<-melt(dt, id.vars=c("Name"))
dt.long<-separate(dt.long,variable, c("test","type"), sep='/[^_]*$/')
dt.wide<-spread(dt.long,key=Name,value=value)
I would like something like
Name type Var1 Var2
1: A BdS NA NA
2: A EVU 2 NA
3: B BdS 3 3
4: B EVU NA NA
5: C BdS 4 4
6: C EVU NA NA
library(tidyr)
df <- data.frame(Name = c("A","B","C"),
Var_1_EVU = c(2,NA,NA),
Var_1_BdS = c(NA,3,4),
Var_2_BdS = c(NA,3,4))
df %>%
gather("type", "value", -Name) %>%
separate(type, into = c("type", "type_num", "var")) %>%
unite(type, type, type_num, sep = "") %>%
spread(type, value)
# Name var Var1 Var2
# 1 A BdS NA NA
# 2 A EVU 2 NA
# 3 B BdS 3 3
# 4 B EVU NA NA
# 5 C BdS 4 4
# 6 C EVU NA NA
example using tidyr::extract
to deal with varnames that have an arbitrary number of underscores...
library(dplyr)
library(tidyr)
df <- data.frame(Name = c("A","B","C"),
Var_x_1_EVU = c(2,NA,NA),
Var_x_1_BdS = c(NA,3,4),
Var_x_y_2_BdS = c(NA,3,4))
df %>%
gather("col_name", "value", -Name) %>%
extract(col_name, c("var", "type"), "(.*)_(.*)") %>%
spread(var, value)
# Name type Var_x_1 Var_x_y_2
# 1 A BdS NA NA
# 2 A EVU 2 NA
# 3 B BdS 3 3
# 4 B EVU NA NA
# 5 C BdS 4 4
# 6 C EVU NA NA
You can avoid a potential problem with duplicate observations by adding a row number column/variable first with mutate(n = row_number())
to make each observation unique, and you can avoid tidyr::extract
being masked by magrittr
by calling it explictly with tidyr::extract
...
library(dplyr)
library(tidyr)
library(data.table)
library(magrittr)
dt <- data.table(Name = c("A", "A", "B", "C"),
Var_1_EVU = c(1, 2, NA, NA),
Var_1_BdS = c(1, NA, 3, 4),
Var_x_2_BdS = c(1, NA, 3, 4))
dt %>%
mutate(n = row_number()) %>%
gather("col_name", "value", -n, -Name) %>%
tidyr::extract(col_name, c("var", "type"), "(.*)_(.*)") %>%
spread(var, value)
# Name n type Var_1 Var_x_2
# 1 A 1 BdS 1 1
# 2 A 1 EVU 1 NA
# 3 A 2 BdS NA NA
# 4 A 2 EVU 2 NA
# 5 B 3 BdS 3 3
# 6 B 3 EVU NA NA
# 7 C 4 BdS 4 4
# 8 C 4 EVU NA NA
Here's an alternative data.table solution using tstrsplit
/melt
/dcast
I would personally stick with data.table
in this case because spread
doesn't have a fun
argument, hence, if you have dupes when spreading again, you will get an error.
library(magrittr) # people like pipes these days
dt %>%
# convert ot long format like you did
melt(., id = "Name") %>%
# split by the last underscore
.[, c("variable", "grp") := tstrsplit(variable, "_(?!.*_)", perl = TRUE)] %>%
# convert back to wide format
dcast(., Name + grp ~ variable)
# Name grp Var_1 Var_2
# 1: A BdS NA NA
# 2: A EVU 2 NA
# 3: B BdS 3 3
# 4: B EVU NA NA
# 5: C BdS 4 4
# 6: C EVU NA NA
来源:https://stackoverflow.com/questions/49900323/separate-string-after-last-underscore