R, data.table or dplyr, long format splitting colnames

佐手、 提交于 2019-12-24 18:05:11

问题


Imagine I have a dataframe with column names such as Mary1, Mary2, Mary3, Bob1, Bob2, Bob3, Pam1, Pam2, Pam3, and so on, but with many more columns.

Let's put a simpler reproducible example.

set.seed(1)
mydata <- data.frame()
mydata <- rbind(mydata,c(1,round(runif(30),3)))
mydata <- rbind(mydata,c(2,round(runif(30),3)))
mydata <- rbind(mydata,c(3,round(runif(30),3)))
colnames(mydata) <- c("id", paste0(rep(LETTERS[1:10], each=3), 1:3))   

that gives:

id    A1    A2    A3    B1    B2    B3    C1    C2    C3    D1    D2    D3    E1    E2    E3    F1    F2    F3    G1    G2    G3    H1    H2    H3    I1    I2    I3    J1    J2    J3  ...
1  0.266 0.372 0.573 0.908 0.202 0.898 0.945 0.661 0.629 0.062 0.206 0.177 0.687 0.384 0.770 0.498 0.718  0.992 0.380 0.777 0.935 0.212 0.652 0.126 0.267 0.386 0.013 0.382 0.870 0.340  ...
2  0.482 0.600 0.494 0.186 0.827 0.668 0.794 0.108 0.724 0.411 0.821 0.647 0.783 0.553 0.530 0.789 0.023  0.477 0.732 0.693 0.478 0.861 0.438 0.245 0.071 0.099 0.316 0.519 0.662 0.407  ...
3  0.913 0.294 0.459 0.332 0.651 0.258 0.479 0.766 0.084 0.875 0.339 0.839 0.347 0.334 0.476 0.892 0.864  0.390 0.777 0.961 0.435 0.713 0.400 0.325 0.757 0.203 0.711 0.122 0.245 0.143  ...

I want to get a long table format, like this:

set.seed(1)
mydata <- data.frame()
mydata <- rbind(mydata,c(1,1,round(runif(10),3)))
mydata <- rbind(mydata,c(1,2,round(runif(10),3)))
mydata <- rbind(mydata,c(1,3,round(runif(10),3)))
mydata <- rbind(mydata,c(2,1,round(runif(10),3)))
mydata <- rbind(mydata,c(2,2,round(runif(10),3)))
mydata <- rbind(mydata,c(2,3,round(runif(10),3)))
colnames(mydata) <- c("id","N", LETTERS[1:10])

that's:

 id  N     A     B     C     D     E     F     G     H     I     J
  1  1 0.266 0.372 0.573 0.908 0.202 0.898 0.945 0.661 0.629 0.062
  1  2 0.206 0.177 0.687 0.384 0.770 0.498 0.718 0.992 0.380 0.777
  1  3 0.482 0.600 0.494 0.186 0.827 0.668 0.794 0.108 0.724 0.411
  2  1 0.935 0.212 0.652 0.126 0.267 0.386 0.013 0.382 0.870 0.340
  2  2 0.821 0.647 0.783 0.553 0.530 0.789 0.023 0.477 0.732 0.693
  2  3 0.478 0.861 0.438 0.245 0.071 0.099 0.316 0.519 0.662 0.407

How can I get it with data.table or dplyr/tidyr? or any other simple option.

If I try

melt(mydata, id=1)

the result is a single column.

I've been checking the official help and the vignettes but I can only find much simpler examples, with a small number of columns, the user specifies every one by hand, and a single example of pattern() but I can't adapt it to my example.

Other threads use gsub, but it's confusing for me.

What I really want to do is a little bit more complicated, but I think this is the first step (I'll later cast it again). Imagine my columns are Mary1, Mary2, Bob1, Bob2, Pam1, Pam2... I want to create new columns with the differences of every two aforementioned: Mary1-Mary2, Bob1-Bob2, Pam1-Pam2...

Summarizing: I don't want to write the name of all columns by hand but select them automatically removing the last digit.

PD: OK, I upgrade my question. It must work not only for names such as A1, A2... but also for longer names, such as

colnames(mydata) <- c("id", paste0(rep(LETTERS[1:10], each=3), rep(LETTERS[1:10], each=3), 1:3)) 

I don't mind the speed, I'm lookking for something simple, not cryptic.


回答1:


Using data.table::melt:

require(data.table)
n = unique(gsub("[0-9]+$", "", names(mydata)[-1L]))
p = paste0("^", n)

melt(setDT(mydata), measure=patterns(p), value.name=n, variable.name="N")
#    id N     A     B     C     D     E     F     G     H     I     J
# 1:  1 1 0.266 0.908 0.945 0.062 0.687 0.498 0.380 0.212 0.267 0.382
# 2:  2 1 0.482 0.186 0.794 0.411 0.783 0.789 0.732 0.861 0.071 0.519
# 3:  3 1 0.913 0.332 0.479 0.875 0.347 0.892 0.777 0.713 0.757 0.122
# 4:  1 2 0.372 0.202 0.661 0.206 0.384 0.718 0.777 0.652 0.386 0.870
# 5:  2 2 0.600 0.827 0.108 0.821 0.553 0.023 0.693 0.438 0.099 0.662
# 6:  3 2 0.294 0.651 0.766 0.339 0.334 0.864 0.961 0.400 0.203 0.245
# 7:  1 3 0.573 0.898 0.629 0.177 0.770 0.992 0.935 0.126 0.013 0.340
# 8:  2 3 0.494 0.668 0.724 0.647 0.530 0.477 0.478 0.245 0.316 0.407
# 9:  3 3 0.459 0.258 0.084 0.839 0.476 0.390 0.435 0.325 0.711 0.143



回答2:


set.seed(1)
mydata <- data.frame()
mydata <- rbind(mydata,c(1,round(runif(30),3)))
mydata <- rbind(mydata,c(2,round(runif(30),3)))
mydata <- rbind(mydata,c(3,round(runif(30),3)))
colnames(mydata) <- c("id", paste0(rep(LETTERS[1:10], each=3), 1:3)) 

reshape(mydata, dir = 'long', varying = names(mydata)[-1], sep = '', timevar = 'N')

#     id N     A     B     C     D     E     F     G     H     I     J
# 1.1  1 1 0.266 0.908 0.945 0.062 0.687 0.498 0.380 0.212 0.267 0.382
# 2.1  2 1 0.482 0.186 0.794 0.411 0.783 0.789 0.732 0.861 0.071 0.519
# 3.1  3 1 0.913 0.332 0.479 0.875 0.347 0.892 0.777 0.713 0.757 0.122
# 1.2  1 2 0.372 0.202 0.661 0.206 0.384 0.718 0.777 0.652 0.386 0.870
# 2.2  2 2 0.600 0.827 0.108 0.821 0.553 0.023 0.693 0.438 0.099 0.662
# 3.2  3 2 0.294 0.651 0.766 0.339 0.334 0.864 0.961 0.400 0.203 0.245
# 1.3  1 3 0.573 0.898 0.629 0.177 0.770 0.992 0.935 0.126 0.013 0.340
# 2.3  2 3 0.494 0.668 0.724 0.647 0.530 0.477 0.478 0.245 0.316 0.407
# 3.3  3 3 0.459 0.258 0.084 0.839 0.476 0.390 0.435 0.325 0.711 0.143



回答3:


Here is one solution with tidyr:

library(tidyr)
mydata %>%
  gather(key, value, -id) %>%
  separate(key, into = c('key1', 'key2'),
           sep = '(?<=[a-zA-Z])(?=[0-9])') %>%
  spread(key1, value)

Resulting output:

  id key2     A     B     C     D     E     F     G     H     I     J
1  1    1 0.266 0.908 0.945 0.062 0.687 0.498 0.380 0.212 0.267 0.382
2  1    2 0.372 0.202 0.661 0.206 0.384 0.718 0.777 0.652 0.386 0.870
3  1    3 0.573 0.898 0.629 0.177 0.770 0.992 0.935 0.126 0.013 0.340
4  2    1 0.482 0.186 0.794 0.411 0.783 0.789 0.732 0.861 0.071 0.519
5  2    2 0.600 0.827 0.108 0.821 0.553 0.023 0.693 0.438 0.099 0.662
6  2    3 0.494 0.668 0.724 0.647 0.530 0.477 0.478 0.245 0.316 0.407
7  3    1 0.913 0.332 0.479 0.875 0.347 0.892 0.777 0.713 0.757 0.122
8  3    2 0.294 0.651 0.766 0.339 0.334 0.864 0.961 0.400 0.203 0.245
9  3    3 0.459 0.258 0.084 0.839 0.476 0.390 0.435 0.325 0.711 0.143


来源:https://stackoverflow.com/questions/37011952/r-data-table-or-dplyr-long-format-splitting-colnames

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!