Using R to parse out Surveymonkey csv files

后端 未结 7 1045
说谎
说谎 2021-02-05 12:01

I\'m trying to analyse a large survey created with surveymonkey which has hundreds of columns in the CSV file and the output format is difficult to use as the headers run over t

相关标签:
7条回答
  • 2021-02-05 12:55

    Coming to the party late, but this is still an issue and the best workaround I've found is using a function to paste the column names and sub-column names together, based on repeating values.

    For instance, if exporting to .csv, the repeated column names will automatically be replaced with an X in RStudio. If exporting to .xlsx, the repeated value will be ....

    Here's a base R solution:

    sm_header_function <- function(x, rep_val){
      
      orig <- x
      
      sv <- x
      sv <- sv[1,]
      sv <- sv[, sapply(sv, Negate(anyNA)), drop = FALSE]
      sv <- t(sv)
      sv <- cbind(rownames(sv), data.frame(sv, row.names = NULL))
      names(sv)[1] <- "name"
      names(sv)[2] <- "value"
      sv$grp <- with(sv, ave(name, FUN = function(x) cumsum(!startsWith(name, rep_val))))
      sv$new_value <- with(sv, ave(name, grp, FUN = function(x) head(x, 1)))
      sv$new_value <- paste0(sv$new_value, " ", sv$value)
      new_names <- as.character(sv$new_value)
      colnames(orig)[which(colnames(orig) %in% sv$name)] <- sv$new_value
      orig <- orig[-c(1),]
      return(orig)
    }
    
    sm_header_function(df, "X")
    sm_header_function(df, "...")
    

    With some sample data, the change in column names would look like this:

    Original export from SurveyMonkey:

    > colnames(sample)
     [1] "Respondent ID"                                 "Please provide your contact information:"      "...11"                                        
     [4] "...12"                                         "...13"                                         "...14"                                        
     [7] "...15"                                         "...16"                                         "...17"                                        
    [10] "...18"                                         "...19"                                         "I wish it would have snowed more this winter."
    

    Cleaned export from SurveyMonkey:

    > colnames(sample_clean)
     [1] "Respondent ID"                                            "Please provide your contact information: Name"           
     [3] "Please provide your contact information: Company"         "Please provide your contact information: Address"        
     [5] "Please provide your contact information: Address 2"       "Please provide your contact information: City/Town"      
     [7] "Please provide your contact information: State/Province"  "Please provide your contact information: ZIP/Postal Code"
     [9] "Please provide your contact information: Country"         "Please provide your contact information: Email Address"  
    [11] "Please provide your contact information: Phone Number"    "I wish it would have snowed more this winter. Response"  
    

    Sample data:

    structure(list(`Respondent ID` = c(NA, 11385284375, 11385273621, 
    11385258069, 11385253194, 11385240121, 11385226951, 11385212508
    ), `Please provide your contact information:` = c("Name", "Benjamin Franklin", 
    "Mae Jemison", "Carl Sagan", "W. E. B. Du Bois", "Florence Nightingale", 
    "Galileo Galilei", "Albert Einstein"), ...11 = c("Company", "Poor Richard's", 
    "NASA", "Smithsonian", "NAACP", "Public Health Co", "NASA", "ThinkTank"
    ), ...12 = c("Address", NA, NA, NA, NA, NA, NA, NA), ...13 = c("Address 2", 
    NA, NA, NA, NA, NA, NA, NA), ...14 = c("City/Town", "Philadelphia", 
    "Decatur", "Washington", "Great Barrington", "Florence", "Pisa", 
    "Princeton"), ...15 = c("State/Province", "PA", "Alabama", "D.C.", 
    "MA", "IT", "IT", "NJ"), ...16 = c("ZIP/Postal Code", "19104", 
    "20104", "33321", "1230", "33225", "12345", "8540"), ...17 = c("Country", 
    NA, NA, NA, NA, NA, NA, NA), ...18 = c("Email Address", "benjamins@gmail.com", 
    "mjemison@nasa.gov", "stargazer@gmail.com", "dubois@web.com", 
    "firstnurse@aol.com", "galileo123@yahoo.com", "imthinking@gmail.com"
    ), ...19 = c("Phone Number", "215-555-4444", "221-134-4646", 
    "999-999-4422", "999-000-1234", "123-456-7899", "111-888-9944", 
    "215-999-8877"), `I wish it would have snowed more this winter.` = c("Response", 
    "Strongly disagree", "Strongly agree", "Neither agree nor disagree", 
    "Strongly disagree", "Disagree", "Agree", "Strongly agree")), row.names = c(NA, 
    -8L), class = c("tbl_df", "tbl", "data.frame"))
    
    0 讨论(0)
提交回复
热议问题