R, data.table or dplyr, long format splitting colnames

问题

Imagine I have a dataframe with column names such as Mary1, Mary2, Mary3, Bob1, Bob2, Bob3, Pam1, Pam2, Pam3, and so on, but with many more columns.

Let's put a simpler reproducible example.

set.seed(1)
mydata <- data.frame()
mydata <- rbind(mydata,c(1,round(runif(30),3)))
mydata <- rbind(mydata,c(2,round(runif(30),3)))
mydata <- rbind(mydata,c(3,round(runif(30),3)))
colnames(mydata) <- c("id", paste0(rep(LETTERS[1:10], each=3), 1:3))

that gives:

id    A1    A2    A3    B1    B2    B3    C1    C2    C3    D1    D2    D3    E1    E2    E3    F1    F2    F3    G1    G2    G3    H1    H2    H3    I1    I2    I3    J1    J2    J3  ...
1  0.266 0.372 0.573 0.908 0.202 0.898 0.945 0.661 0.629 0.062 0.206 0.177 0.687 0.384 0.770 0.498 0.718  0.992 0.380 0.777 0.935 0.212 0.652 0.126 0.267 0.386 0.013 0.382 0.870 0.340  ...
2  0.482 0.600 0.494 0.186 0.827 0.668 0.794 0.108 0.724 0.411 0.821 0.647 0.783 0.553 0.530 0.789 0.023  0.477 0.732 0.693 0.478 0.861 0.438 0.245 0.071 0.099 0.316 0.519 0.662 0.407  ...
3  0.913 0.294 0.459 0.332 0.651 0.258 0.479 0.766 0.084 0.875 0.339 0.839 0.347 0.334 0.476 0.892 0.864  0.390 0.777 0.961 0.435 0.713 0.400 0.325 0.757 0.203 0.711 0.122 0.245 0.143  ...

I want to get a long table format, like this:

set.seed(1)
mydata <- data.frame()
mydata <- rbind(mydata,c(1,1,round(runif(10),3)))
mydata <- rbind(mydata,c(1,2,round(runif(10),3)))
mydata <- rbind(mydata,c(1,3,round(runif(10),3)))
mydata <- rbind(mydata,c(2,1,round(runif(10),3)))
mydata <- rbind(mydata,c(2,2,round(runif(10),3)))
mydata <- rbind(mydata,c(2,3,round(runif(10),3)))
colnames(mydata) <- c("id","N", LETTERS[1:10])

that's:

 id  N     A     B     C     D     E     F     G     H     I     J
  1  1 0.266 0.372 0.573 0.908 0.202 0.898 0.945 0.661 0.629 0.062
  1  2 0.206 0.177 0.687 0.384 0.770 0.498 0.718 0.992 0.380 0.777
  1  3 0.482 0.600 0.494 0.186 0.827 0.668 0.794 0.108 0.724 0.411
  2  1 0.935 0.212 0.652 0.126 0.267 0.386 0.013 0.382 0.870 0.340
  2  2 0.821 0.647 0.783 0.553 0.530 0.789 0.023 0.477 0.732 0.693
  2  3 0.478 0.861 0.438 0.245 0.071 0.099 0.316 0.519 0.662 0.407

How can I get it with data.table or dplyr/tidyr? or any other simple option.

If I try

melt(mydata, id=1)

the result is a single column.

I've been checking the official help and the vignettes but I can only find much simpler examples, with a small number of columns, the user specifies every one by hand, and a single example of pattern() but I can't adapt it to my example.

Other threads use gsub, but it's confusing for me.

What I really want to do is a little bit more complicated, but I think this is the first step (I'll later cast it again). Imagine my columns are Mary1, Mary2, Bob1, Bob2, Pam1, Pam2... I want to create new columns with the differences of every two aforementioned: Mary1-Mary2, Bob1-Bob2, Pam1-Pam2...

Summarizing: I don't want to write the name of all columns by hand but select them automatically removing the last digit.

PD: OK, I upgrade my question. It must work not only for names such as A1, A2... but also for longer names, such as

colnames(mydata) <- c("id", paste0(rep(LETTERS[1:10], each=3), rep(LETTERS[1:10], each=3), 1:3))

I don't mind the speed, I'm lookking for something simple, not cryptic.

回答1:

Using data.table::melt:

require(data.table)
n = unique(gsub("[0-9]+$", "", names(mydata)[-1L]))
p = paste0("^", n)

melt(setDT(mydata), measure=patterns(p), value.name=n, variable.name="N")
#    id N     A     B     C     D     E     F     G     H     I     J
# 1:  1 1 0.266 0.908 0.945 0.062 0.687 0.498 0.380 0.212 0.267 0.382
# 2:  2 1 0.482 0.186 0.794 0.411 0.783 0.789 0.732 0.861 0.071 0.519
# 3:  3 1 0.913 0.332 0.479 0.875 0.347 0.892 0.777 0.713 0.757 0.122
# 4:  1 2 0.372 0.202 0.661 0.206 0.384 0.718 0.777 0.652 0.386 0.870
# 5:  2 2 0.600 0.827 0.108 0.821 0.553 0.023 0.693 0.438 0.099 0.662
# 6:  3 2 0.294 0.651 0.766 0.339 0.334 0.864 0.961 0.400 0.203 0.245
# 7:  1 3 0.573 0.898 0.629 0.177 0.770 0.992 0.935 0.126 0.013 0.340
# 8:  2 3 0.494 0.668 0.724 0.647 0.530 0.477 0.478 0.245 0.316 0.407
# 9:  3 3 0.459 0.258 0.084 0.839 0.476 0.390 0.435 0.325 0.711 0.143

回答2:

set.seed(1)
mydata <- data.frame()
mydata <- rbind(mydata,c(1,round(runif(30),3)))
mydata <- rbind(mydata,c(2,round(runif(30),3)))
mydata <- rbind(mydata,c(3,round(runif(30),3)))
colnames(mydata) <- c("id", paste0(rep(LETTERS[1:10], each=3), 1:3)) 

reshape(mydata, dir = 'long', varying = names(mydata)[-1], sep = '', timevar = 'N')

#     id N     A     B     C     D     E     F     G     H     I     J
# 1.1  1 1 0.266 0.908 0.945 0.062 0.687 0.498 0.380 0.212 0.267 0.382
# 2.1  2 1 0.482 0.186 0.794 0.411 0.783 0.789 0.732 0.861 0.071 0.519
# 3.1  3 1 0.913 0.332 0.479 0.875 0.347 0.892 0.777 0.713 0.757 0.122
# 1.2  1 2 0.372 0.202 0.661 0.206 0.384 0.718 0.777 0.652 0.386 0.870
# 2.2  2 2 0.600 0.827 0.108 0.821 0.553 0.023 0.693 0.438 0.099 0.662
# 3.2  3 2 0.294 0.651 0.766 0.339 0.334 0.864 0.961 0.400 0.203 0.245
# 1.3  1 3 0.573 0.898 0.629 0.177 0.770 0.992 0.935 0.126 0.013 0.340
# 2.3  2 3 0.494 0.668 0.724 0.647 0.530 0.477 0.478 0.245 0.316 0.407
# 3.3  3 3 0.459 0.258 0.084 0.839 0.476 0.390 0.435 0.325 0.711 0.143

回答3:

Here is one solution with tidyr:

library(tidyr)
mydata %>%
  gather(key, value, -id) %>%
  separate(key, into = c('key1', 'key2'),
           sep = '(?<=[a-zA-Z])(?=[0-9])') %>%
  spread(key1, value)

Resulting output:

  id key2     A     B     C     D     E     F     G     H     I     J
1  1    1 0.266 0.908 0.945 0.062 0.687 0.498 0.380 0.212 0.267 0.382
2  1    2 0.372 0.202 0.661 0.206 0.384 0.718 0.777 0.652 0.386 0.870
3  1    3 0.573 0.898 0.629 0.177 0.770 0.992 0.935 0.126 0.013 0.340
4  2    1 0.482 0.186 0.794 0.411 0.783 0.789 0.732 0.861 0.071 0.519
5  2    2 0.600 0.827 0.108 0.821 0.553 0.023 0.693 0.438 0.099 0.662
6  2    3 0.494 0.668 0.724 0.647 0.530 0.477 0.478 0.245 0.316 0.407
7  3    1 0.913 0.332 0.479 0.875 0.347 0.892 0.777 0.713 0.757 0.122
8  3    2 0.294 0.651 0.766 0.339 0.334 0.864 0.961 0.400 0.203 0.245
9  3    3 0.459 0.258 0.084 0.839 0.476 0.390 0.435 0.325 0.711 0.143

来源：https://stackoverflow.com/questions/37011952/r-data-table-or-dplyr-long-format-splitting-colnames

标签

reshape