R dates as column names containing duplicate values (need to retain original date)

问题

I have a dataset I'm trying to tidy up. I read in the file with read.xlsx, contained in the header is date values that I need to retain their values even when duplicated when I gather/spread the data.

The data set looks like the below. The dates from excel read in as numbers (which is fine) the issue is that there can be duplicate dates (e.g. 43693) , which I need to keep their original values.

      Date        43693 43686 43686 43714 43693
1     Contract    111   222   333   444   555
2     Org1        NR    NB    NR    NB     P
3     Org2         P     P     P    NB    NR
4     Org3        NB    NB    NB    NB     P

When I try to transform the data I get the error of duplicate names.

Ultimately I'm trying to get by data shaped like this, where the date value retains any duplicates (e.g. 43693)

    Date        Contract              ORG     status
 1 43693            111              Org1     NR
 2 43493            555              Org1     P    
 3 43686            111              Org2     P

Here is an example df to test on:

 df <- structure(
     list(
      Date = c("Contract", "Org1", "Org2", "Org3", "Org4"), 
      '12/16/18' = c("111", "pending", "complete", "complete", 
       "pending"), 
       '12/16/18' = c("222", "pending", "complete", "pending", 
         "complete"),
      '1/18/18' = c("222", "pending", "complete", "pending", 
     "complete") ), 
     class = "data.frame", 
   .Names = c("Date", "12/16/18", "12/16/18",'1/18/18'), 
    row.names = c(NA, -5L)
     )

回答1:

You have two header rows, which is pretty messy. I'd recommend re-reading the data, skipping the date line, then incorporating the date line as part of the column names.

If you already have the data read in, you can try something like this:

library(data.table)
df2 <- setDT(df[-1, ])
setnames(df2, c("Org", paste(names(df), unlist(df[1, ], use.names = FALSE), sep = "_")[-1]))
# Current data
df2
#     Org 12/16/18_111 12/16/18_222 1/18/18_222
# 1: Org1      pending      pending     pending
# 2: Org2     complete     complete    complete
# 3: Org3     complete      pending     pending
# 4: Org4      pending     complete    complete

# melt and split
melt(df2, id.vars="Org")[, c("Date", "Contract") := tstrsplit(variable, "_")][, variable := NULL][]
#      Org    value     Date Contract
#  1: Org1  pending 12/16/18      111
#  2: Org2 complete 12/16/18      111
#  3: Org3 complete 12/16/18      111
#  4: Org4  pending 12/16/18      111
#  5: Org1  pending 12/16/18      222
#  6: Org2 complete 12/16/18      222
#  7: Org3  pending 12/16/18      222
#  8: Org4 complete 12/16/18      222
#  9: Org1  pending  1/18/18      222
# 10: Org2 complete  1/18/18      222
# 11: Org3  pending  1/18/18      222
# 12: Org4 complete  1/18/18      222

If you do want to stick with dplyr and tidyr, here's a translation of the above:

library(dplyr)
library(tidyr)
setNames(df, c("Org", paste(names(df), unlist(df[1, ], use.names = FALSE), sep = "_")[-1])) %>% 
  slice(-1) %>% 
  pivot_longer(-Org) %>% 
  separate(name, into = c("Date", "Contract"), sep = "_")

Note that you have to rename the dataset before you start chaining the other commands together.

回答2:

Indeed, having duplicate column names is a very bad idea. Dates as column headers feels problematic as well. If you have the opportunity to change the original data to avoid these issues please do so.

Here is another approach: read the data with the duplicate names, save those column names in a row, transpose the data frame and then convert the previously saved row into a column in the new data frame. Finally, use tidyr pivot_longer to create a long data frame. Not an elegant solution...

library(dplyr)
library(tidyr)

# create the data
df <- data.frame(
  Date = c("Contract", "Org1", "Org2", "Org3", "Org4"),
  '12/16/18' = c("111", "pending", "complete", "complete", "pending"),
  '12/16/18' = c("222", "pending", "complete", "pending", "complete"),
  '1/18/18' = c("333", "pending", "complete", "pending", "complete"),
  stringsAsFactors = FALSE,
  check.names = FALSE
)

header <- colnames(df) # store column names
colnames(df) <- paste0("V", 1:ncol(df)) #rename columns with unique names
df[nrow(df) + 1, ] <- header # add original columns names as a row in df

df2 <- as.data.frame(t(df), stringsAsFactors = FALSE) # transpose and convert to df
names(df2) <- t(df2[1, ]) # rename the columns of the new df
df2 <- df2[-1, ] # remove first row

df3 <- df2 %>% # pivot the df to long shape
  pivot_longer(cols = contains("Org"),
              names_to = "ORG",
              values_to = "Status")

With this output:

> df3
# A tibble: 12 x 4
   Contract Date     ORG   Status  
 * <chr>    <chr>    <chr> <chr>   
 1 111      12/16/18 Org1  pending 
 2 111      12/16/18 Org2  complete
 3 111      12/16/18 Org3  complete
 4 111      12/16/18 Org4  pending 
 5 222      12/16/18 Org1  pending 
 6 222      12/16/18 Org2  complete
 7 222      12/16/18 Org3  pending 
 8 222      12/16/18 Org4  complete
 9 333      1/18/18  Org1  pending 
10 333      1/18/18  Org2  complete
11 333      1/18/18  Org3  pending 
12 333      1/18/18  Org4  complete

来源：https://stackoverflow.com/questions/62567018/r-dates-as-column-names-containing-duplicate-values-need-to-retain-original-dat

标签

date

duplicates

tidyverse

reshape