问题
I have a dataset I'm trying to tidy up. I read in the file with read.xlsx, contained in the header is date values that I need to retain their values even when duplicated when I gather/spread the data.
The data set looks like the below. The dates from excel read in as numbers (which is fine) the issue is that there can be duplicate dates (e.g. 43693) , which I need to keep their original values.
Date 43693 43686 43686 43714 43693
1 Contract 111 222 333 444 555
2 Org1 NR NB NR NB P
3 Org2 P P P NB NR
4 Org3 NB NB NB NB P
When I try to transform the data I get the error of duplicate names.
Ultimately I'm trying to get by data shaped like this, where the date value retains any duplicates (e.g. 43693)
Date Contract ORG status
1 43693 111 Org1 NR
2 43493 555 Org1 P
3 43686 111 Org2 P
Here is an example df to test on:
df <- structure(
list(
Date = c("Contract", "Org1", "Org2", "Org3", "Org4"),
'12/16/18' = c("111", "pending", "complete", "complete",
"pending"),
'12/16/18' = c("222", "pending", "complete", "pending",
"complete"),
'1/18/18' = c("222", "pending", "complete", "pending",
"complete") ),
class = "data.frame",
.Names = c("Date", "12/16/18", "12/16/18",'1/18/18'),
row.names = c(NA, -5L)
)
回答1:
You have two header rows, which is pretty messy. I'd recommend re-reading the data, skipping the date line, then incorporating the date line as part of the column names.
If you already have the data read in, you can try something like this:
library(data.table)
df2 <- setDT(df[-1, ])
setnames(df2, c("Org", paste(names(df), unlist(df[1, ], use.names = FALSE), sep = "_")[-1]))
# Current data
df2
# Org 12/16/18_111 12/16/18_222 1/18/18_222
# 1: Org1 pending pending pending
# 2: Org2 complete complete complete
# 3: Org3 complete pending pending
# 4: Org4 pending complete complete
# melt and split
melt(df2, id.vars="Org")[, c("Date", "Contract") := tstrsplit(variable, "_")][, variable := NULL][]
# Org value Date Contract
# 1: Org1 pending 12/16/18 111
# 2: Org2 complete 12/16/18 111
# 3: Org3 complete 12/16/18 111
# 4: Org4 pending 12/16/18 111
# 5: Org1 pending 12/16/18 222
# 6: Org2 complete 12/16/18 222
# 7: Org3 pending 12/16/18 222
# 8: Org4 complete 12/16/18 222
# 9: Org1 pending 1/18/18 222
# 10: Org2 complete 1/18/18 222
# 11: Org3 pending 1/18/18 222
# 12: Org4 complete 1/18/18 222
If you do want to stick with dplyr
and tidyr
, here's a translation of the above:
library(dplyr)
library(tidyr)
setNames(df, c("Org", paste(names(df), unlist(df[1, ], use.names = FALSE), sep = "_")[-1])) %>%
slice(-1) %>%
pivot_longer(-Org) %>%
separate(name, into = c("Date", "Contract"), sep = "_")
Note that you have to rename the dataset before you start chaining the other commands together.
回答2:
Indeed, having duplicate column names is a very bad idea. Dates as column headers feels problematic as well. If you have the opportunity to change the original data to avoid these issues please do so.
Here is another approach: read the data with the duplicate names, save those column names in a row, transpose the data frame and then convert the previously saved row into a column in the new data frame. Finally, use tidyr
pivot_longer
to create a long data frame. Not an elegant solution...
library(dplyr)
library(tidyr)
# create the data
df <- data.frame(
Date = c("Contract", "Org1", "Org2", "Org3", "Org4"),
'12/16/18' = c("111", "pending", "complete", "complete", "pending"),
'12/16/18' = c("222", "pending", "complete", "pending", "complete"),
'1/18/18' = c("333", "pending", "complete", "pending", "complete"),
stringsAsFactors = FALSE,
check.names = FALSE
)
header <- colnames(df) # store column names
colnames(df) <- paste0("V", 1:ncol(df)) #rename columns with unique names
df[nrow(df) + 1, ] <- header # add original columns names as a row in df
df2 <- as.data.frame(t(df), stringsAsFactors = FALSE) # transpose and convert to df
names(df2) <- t(df2[1, ]) # rename the columns of the new df
df2 <- df2[-1, ] # remove first row
df3 <- df2 %>% # pivot the df to long shape
pivot_longer(cols = contains("Org"),
names_to = "ORG",
values_to = "Status")
With this output:
> df3
# A tibble: 12 x 4
Contract Date ORG Status
* <chr> <chr> <chr> <chr>
1 111 12/16/18 Org1 pending
2 111 12/16/18 Org2 complete
3 111 12/16/18 Org3 complete
4 111 12/16/18 Org4 pending
5 222 12/16/18 Org1 pending
6 222 12/16/18 Org2 complete
7 222 12/16/18 Org3 pending
8 222 12/16/18 Org4 complete
9 333 1/18/18 Org1 pending
10 333 1/18/18 Org2 complete
11 333 1/18/18 Org3 pending
12 333 1/18/18 Org4 complete
来源:https://stackoverflow.com/questions/62567018/r-dates-as-column-names-containing-duplicate-values-need-to-retain-original-dat