Using R to parse out Surveymonkey csv files

后端未结

关注

 7  1069

I\'m trying to analyse a large survey created with surveymonkey which has hundreds of columns in the CSV file and the output format is difficult to use as the headers run over t

相关标签:

7条回答

面向向阳花

2021-02-05 12:55

Coming to the party late, but this is still an issue and the best workaround I've found is using a function to paste the column names and sub-column names together, based on repeating values.

For instance, if exporting to .csv, the repeated column names will automatically be replaced with an X in RStudio. If exporting to .xlsx, the repeated value will be ....

Here's a base R solution:

sm_header_function <- function(x, rep_val){
  
  orig <- x
  
  sv <- x
  sv <- sv[1,]
  sv <- sv[, sapply(sv, Negate(anyNA)), drop = FALSE]
  sv <- t(sv)
  sv <- cbind(rownames(sv), data.frame(sv, row.names = NULL))
  names(sv)[1] <- "name"
  names(sv)[2] <- "value"
  sv$grp <- with(sv, ave(name, FUN = function(x) cumsum(!startsWith(name, rep_val))))
  sv$new_value <- with(sv, ave(name, grp, FUN = function(x) head(x, 1)))
  sv$new_value <- paste0(sv$new_value, " ", sv$value)
  new_names <- as.character(sv$new_value)
  colnames(orig)[which(colnames(orig) %in% sv$name)] <- sv$new_value
  orig <- orig[-c(1),]
  return(orig)
}

sm_header_function(df, "X")
sm_header_function(df, "...")

With some sample data, the change in column names would look like this:

Original export from SurveyMonkey:

> colnames(sample)
 [1] "Respondent ID"                                 "Please provide your contact information:"      "...11"                                        
 [4] "...12"                                         "...13"                                         "...14"                                        
 [7] "...15"                                         "...16"                                         "...17"                                        
[10] "...18"                                         "...19"                                         "I wish it would have snowed more this winter."

Cleaned export from SurveyMonkey:

> colnames(sample_clean)
 [1] "Respondent ID"                                            "Please provide your contact information: Name"           
 [3] "Please provide your contact information: Company"         "Please provide your contact information: Address"        
 [5] "Please provide your contact information: Address 2"       "Please provide your contact information: City/Town"      
 [7] "Please provide your contact information: State/Province"  "Please provide your contact information: ZIP/Postal Code"
 [9] "Please provide your contact information: Country"         "Please provide your contact information: Email Address"  
[11] "Please provide your contact information: Phone Number"    "I wish it would have snowed more this winter. Response"

Sample data:

structure(list(`Respondent ID` = c(NA, 11385284375, 11385273621, 
11385258069, 11385253194, 11385240121, 11385226951, 11385212508
), `Please provide your contact information:` = c("Name", "Benjamin Franklin", 
"Mae Jemison", "Carl Sagan", "W. E. B. Du Bois", "Florence Nightingale", 
"Galileo Galilei", "Albert Einstein"), ...11 = c("Company", "Poor Richard's", 
"NASA", "Smithsonian", "NAACP", "Public Health Co", "NASA", "ThinkTank"
), ...12 = c("Address", NA, NA, NA, NA, NA, NA, NA), ...13 = c("Address 2", 
NA, NA, NA, NA, NA, NA, NA), ...14 = c("City/Town", "Philadelphia", 
"Decatur", "Washington", "Great Barrington", "Florence", "Pisa", 
"Princeton"), ...15 = c("State/Province", "PA", "Alabama", "D.C.", 
"MA", "IT", "IT", "NJ"), ...16 = c("ZIP/Postal Code", "19104", 
"20104", "33321", "1230", "33225", "12345", "8540"), ...17 = c("Country", 
NA, NA, NA, NA, NA, NA, NA), ...18 = c("Email Address", "benjamins@gmail.com", 
"mjemison@nasa.gov", "stargazer@gmail.com", "dubois@web.com", 
"firstnurse@aol.com", "galileo123@yahoo.com", "imthinking@gmail.com"
), ...19 = c("Phone Number", "215-555-4444", "221-134-4646", 
"999-999-4422", "999-000-1234", "123-456-7899", "111-888-9944", 
"215-999-8877"), `I wish it would have snowed more this winter.` = c("Response", 
"Strongly disagree", "Strongly agree", "Neither agree nor disagree", 
"Strongly disagree", "Disagree", "Agree", "Strongly agree")), row.names = c(NA, 
-8L), class = c("tbl_df", "tbl", "data.frame"))

0 讨论(0)

上一页 1 2