How to best reshape a data set in R that has a two-row header?

不打扰是莪最后的温柔 提交于 2019-12-24 16:42:51

问题


The data set I'm working with is in Excel. It shows sales of products in both unit and revenue terms for the first 26 weeks of availability.

Each row of data represents a product. Let's say there are 50 of them.
The 2nd header row could basically be reconstructed with rep(("Units","Revenue"),26) Above each of those ("Units","Revenue") pairs in the 1st header row is a merged pair of cells taking the sequence "Week 1", "Week 2"...."Week 26".

I basically want to convert the dataset from 50 rows to 50*26 = 1300 rows with 4 columns (Product, Week, Units, Sales).

I've seen how to handle two row headers and how to reshape data with the melt function, but I'm not sure I've seen anything that indicates a best practice for combining the two, especially in cases like this where both header rows contain key information needed to reshape the data.


回答1:


It is somwhat abiguous what sort of csv file might result from merged cells but assuming there are twice as many such cells you would first need to read in the first two lines with readLines using sep=",", then:

gsub( " ", "", paste( rep( row1[row1 > ""], each=2), c("Units","Revenue"), sep="_") )

To any red-hot moderator: yes, I know code-only answers are deprecated , but I think they should be acceptable for answering code and data-deficient questions.




回答2:


I have run into the same problem many times and have used melt in reshape2 in the past. But here is a function that takes multiple rows of headers as well as multiple columns:

PivReady <- function(data,label_rows,label_columns){
  c<-nrow(data)
  d<-ncol(data)
  pivRdata <- data.frame(matrix(ncol = (label_columns+label_rows+1), nrow = ((c-label_rows)*(d-label_columns))))
    for(i in 1:label_columns){
      pivRdata[,i]<-rep(data[(label_rows+1):c,i],each=(d-label_columns)) 
      }
  trowlabels<-t(data[1:label_rows,(label_columns+1):d])
  pivRdata[,(label_columns+1):(label_columns+label_rows)]<-do.call(rbind, replicate(((c-label_rows)*(d-label_columns))/(d-label_columns), trowlabels, simplify=FALSE))
  datatrans<-t(data[(label_rows+1):c,(label_columns+1):d])
  datatrans<-as.vector(datatrans)
  pivRdata[,(label_columns+label_rows+1)]<-as.data.frame(datatrans)
  names <- data.frame(matrix(ncol = (label_columns+label_rows+1), nrow = 1))
  names[1,1:label_columns]<-as.matrix(data[label_rows,1:label_columns])
  names[1,(label_columns+1):(label_columns+label_rows)]<-paste("Category",1:label_rows,sep="")
  names[1,(label_columns+label_rows+1)]<-"Value"
  names(pivRdata)<-names
  return(pivRdata)
}

Yes, I know this code is not very beautiful but if you import your data with headers=FALSE and then specify in the above function that the data has e.g. 2 columns of labels (left most columns), and 3 rows of headers, then this works quite nicely.

eg.

long_data <- PivReady(wide_data,3,2)


来源:https://stackoverflow.com/questions/23233606/how-to-best-reshape-a-data-set-in-r-that-has-a-two-row-header

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!