问题
Imagine I have a dataframe with column names such as Mary1, Mary2, Mary3, Bob1, Bob2, Bob3, Pam1, Pam2, Pam3, and so on, but with many more columns.
Let's put a simpler reproducible example.
set.seed(1)
mydata <- data.frame()
mydata <- rbind(mydata,c(1,round(runif(30),3)))
mydata <- rbind(mydata,c(2,round(runif(30),3)))
mydata <- rbind(mydata,c(3,round(runif(30),3)))
colnames(mydata) <- c("id", paste0(rep(LETTERS[1:10], each=3), 1:3))
that gives:
id A1 A2 A3 B1 B2 B3 C1 C2 C3 D1 D2 D3 E1 E2 E3 F1 F2 F3 G1 G2 G3 H1 H2 H3 I1 I2 I3 J1 J2 J3 ...
1 0.266 0.372 0.573 0.908 0.202 0.898 0.945 0.661 0.629 0.062 0.206 0.177 0.687 0.384 0.770 0.498 0.718 0.992 0.380 0.777 0.935 0.212 0.652 0.126 0.267 0.386 0.013 0.382 0.870 0.340 ...
2 0.482 0.600 0.494 0.186 0.827 0.668 0.794 0.108 0.724 0.411 0.821 0.647 0.783 0.553 0.530 0.789 0.023 0.477 0.732 0.693 0.478 0.861 0.438 0.245 0.071 0.099 0.316 0.519 0.662 0.407 ...
3 0.913 0.294 0.459 0.332 0.651 0.258 0.479 0.766 0.084 0.875 0.339 0.839 0.347 0.334 0.476 0.892 0.864 0.390 0.777 0.961 0.435 0.713 0.400 0.325 0.757 0.203 0.711 0.122 0.245 0.143 ...
I want to get a long table format, like this:
set.seed(1)
mydata <- data.frame()
mydata <- rbind(mydata,c(1,1,round(runif(10),3)))
mydata <- rbind(mydata,c(1,2,round(runif(10),3)))
mydata <- rbind(mydata,c(1,3,round(runif(10),3)))
mydata <- rbind(mydata,c(2,1,round(runif(10),3)))
mydata <- rbind(mydata,c(2,2,round(runif(10),3)))
mydata <- rbind(mydata,c(2,3,round(runif(10),3)))
colnames(mydata) <- c("id","N", LETTERS[1:10])
that's:
id N A B C D E F G H I J
1 1 0.266 0.372 0.573 0.908 0.202 0.898 0.945 0.661 0.629 0.062
1 2 0.206 0.177 0.687 0.384 0.770 0.498 0.718 0.992 0.380 0.777
1 3 0.482 0.600 0.494 0.186 0.827 0.668 0.794 0.108 0.724 0.411
2 1 0.935 0.212 0.652 0.126 0.267 0.386 0.013 0.382 0.870 0.340
2 2 0.821 0.647 0.783 0.553 0.530 0.789 0.023 0.477 0.732 0.693
2 3 0.478 0.861 0.438 0.245 0.071 0.099 0.316 0.519 0.662 0.407
How can I get it with data.table or dplyr/tidyr? or any other simple option.
If I try
melt(mydata, id=1)
the result is a single column.
I've been checking the official help and the vignettes but I can only find much simpler examples, with a small number of columns, the user specifies every one by hand, and a single example of pattern() but I can't adapt it to my example.
Other threads use gsub, but it's confusing for me.
What I really want to do is a little bit more complicated, but I think this is the first step (I'll later cast it again). Imagine my columns are Mary1, Mary2, Bob1, Bob2, Pam1, Pam2... I want to create new columns with the differences of every two aforementioned: Mary1-Mary2, Bob1-Bob2, Pam1-Pam2...
Summarizing: I don't want to write the name of all columns by hand but select them automatically removing the last digit.
PD: OK, I upgrade my question. It must work not only for names such as A1, A2... but also for longer names, such as
colnames(mydata) <- c("id", paste0(rep(LETTERS[1:10], each=3), rep(LETTERS[1:10], each=3), 1:3))
I don't mind the speed, I'm lookking for something simple, not cryptic.
回答1:
Using data.table::melt
:
require(data.table)
n = unique(gsub("[0-9]+$", "", names(mydata)[-1L]))
p = paste0("^", n)
melt(setDT(mydata), measure=patterns(p), value.name=n, variable.name="N")
# id N A B C D E F G H I J
# 1: 1 1 0.266 0.908 0.945 0.062 0.687 0.498 0.380 0.212 0.267 0.382
# 2: 2 1 0.482 0.186 0.794 0.411 0.783 0.789 0.732 0.861 0.071 0.519
# 3: 3 1 0.913 0.332 0.479 0.875 0.347 0.892 0.777 0.713 0.757 0.122
# 4: 1 2 0.372 0.202 0.661 0.206 0.384 0.718 0.777 0.652 0.386 0.870
# 5: 2 2 0.600 0.827 0.108 0.821 0.553 0.023 0.693 0.438 0.099 0.662
# 6: 3 2 0.294 0.651 0.766 0.339 0.334 0.864 0.961 0.400 0.203 0.245
# 7: 1 3 0.573 0.898 0.629 0.177 0.770 0.992 0.935 0.126 0.013 0.340
# 8: 2 3 0.494 0.668 0.724 0.647 0.530 0.477 0.478 0.245 0.316 0.407
# 9: 3 3 0.459 0.258 0.084 0.839 0.476 0.390 0.435 0.325 0.711 0.143
回答2:
set.seed(1)
mydata <- data.frame()
mydata <- rbind(mydata,c(1,round(runif(30),3)))
mydata <- rbind(mydata,c(2,round(runif(30),3)))
mydata <- rbind(mydata,c(3,round(runif(30),3)))
colnames(mydata) <- c("id", paste0(rep(LETTERS[1:10], each=3), 1:3))
reshape(mydata, dir = 'long', varying = names(mydata)[-1], sep = '', timevar = 'N')
# id N A B C D E F G H I J
# 1.1 1 1 0.266 0.908 0.945 0.062 0.687 0.498 0.380 0.212 0.267 0.382
# 2.1 2 1 0.482 0.186 0.794 0.411 0.783 0.789 0.732 0.861 0.071 0.519
# 3.1 3 1 0.913 0.332 0.479 0.875 0.347 0.892 0.777 0.713 0.757 0.122
# 1.2 1 2 0.372 0.202 0.661 0.206 0.384 0.718 0.777 0.652 0.386 0.870
# 2.2 2 2 0.600 0.827 0.108 0.821 0.553 0.023 0.693 0.438 0.099 0.662
# 3.2 3 2 0.294 0.651 0.766 0.339 0.334 0.864 0.961 0.400 0.203 0.245
# 1.3 1 3 0.573 0.898 0.629 0.177 0.770 0.992 0.935 0.126 0.013 0.340
# 2.3 2 3 0.494 0.668 0.724 0.647 0.530 0.477 0.478 0.245 0.316 0.407
# 3.3 3 3 0.459 0.258 0.084 0.839 0.476 0.390 0.435 0.325 0.711 0.143
回答3:
Here is one solution with tidyr
:
library(tidyr)
mydata %>%
gather(key, value, -id) %>%
separate(key, into = c('key1', 'key2'),
sep = '(?<=[a-zA-Z])(?=[0-9])') %>%
spread(key1, value)
Resulting output:
id key2 A B C D E F G H I J
1 1 1 0.266 0.908 0.945 0.062 0.687 0.498 0.380 0.212 0.267 0.382
2 1 2 0.372 0.202 0.661 0.206 0.384 0.718 0.777 0.652 0.386 0.870
3 1 3 0.573 0.898 0.629 0.177 0.770 0.992 0.935 0.126 0.013 0.340
4 2 1 0.482 0.186 0.794 0.411 0.783 0.789 0.732 0.861 0.071 0.519
5 2 2 0.600 0.827 0.108 0.821 0.553 0.023 0.693 0.438 0.099 0.662
6 2 3 0.494 0.668 0.724 0.647 0.530 0.477 0.478 0.245 0.316 0.407
7 3 1 0.913 0.332 0.479 0.875 0.347 0.892 0.777 0.713 0.757 0.122
8 3 2 0.294 0.651 0.766 0.339 0.334 0.864 0.961 0.400 0.203 0.245
9 3 3 0.459 0.258 0.084 0.839 0.476 0.390 0.435 0.325 0.711 0.143
来源:https://stackoverflow.com/questions/37011952/r-data-table-or-dplyr-long-format-splitting-colnames