Using R to parse out Surveymonkey csv files

后端 未结 7 1061
说谎
说谎 2021-02-05 12:01

I\'m trying to analyse a large survey created with surveymonkey which has hundreds of columns in the CSV file and the output format is difficult to use as the headers run over t

相关标签:
7条回答
  • 2021-02-05 12:36

    The issue with the headers is that columns with "select all that apply" will have a blank top row, and the column heading will be the row below. This is only an issue for those types of questions.

    With this in mind, I wrote a loop to go through all columns and replace the column names with the value from the second row if the column name was blank- which has a character length of 1.

    Then, you can kill the second row of the data and have a tidy data frame.

    for(i in 1:ncol(df)){
    newname <- colnames(df)[i]
    if(nchar(newname) < 2){
    colnames(df)[i] <- df[1,i]
    } 
    
    df <- df[-1,]
    
    0 讨论(0)
  • 2021-02-05 12:39

    As of November 2013, the webpage layout seems to have changed. Choose Analyze results > Export All > All Responses Data > Original View > XLS+ (Open in advanced statistical and analytical software). Then go to Exports and download the file. You'll get raw data as first row = question headers / each following row = 1 response, possibly split between multiple files if you have many responses / questions.

    enter image description here

    0 讨论(0)
  • 2021-02-05 12:44

    What I did in the end was print out the headers using libreoffice labeled as V1,V2, etc. then I just read in the file as

     m1 <- read.csv('Sheet1.csv', header=FALSE, skip=1)
    

    and then just did the analysis against m1$V10, m1$V23 etc...

    To get around the mess of multiple columns I used the following little function

    # function to merge columns into one with a space separator and then
    # remove multiple spaces
    mcols <- function(df, cols) {
        # e.g. mcols(df, c(14:18))
            exp <- paste('df[,', cols, ']', sep='', collapse=',' )
            # this creates something like...
            # "df[,14],df[,15],df[,16],df[,17],df[,18]"
            # now we just want to do a paste of this expression...
            nexp <- paste(" paste(", exp, ", sep=' ')")
            # so now nexp looks something like...
            # " paste( df[,14],df[,15],df[,16],df[,17],df[,18] , sep='')"
            # now we just need to parse this text... and eval() it...
            newcol <- eval(parse(text=nexp))
            newcol <- gsub('  *', ' ', newcol) # replace duplicate spaces by a single one
            newcol <- gsub('^ *', '', newcol) # remove leading spaces
            gsub(' *$', '', newcol) # remove trailing spaces
    }
    # mcols(df, c(14:18))
    

    No doubt somebody will be able to clean this up!

    To tidy up Likert-like scales I used:

    # function to tidy c('Strongly Agree', 'Agree', 'Disagree', 'Strongly Disagree')
    tidylik4 <- function(x) {
      xlevels <- c('Strongly Disagree', 'Disagree', 'Agree', 'Strongly Agree')
      y <- ifelse(x == '', NA, x)
      ordered(y, levels=xlevels)
    }
    
    for (i in 44:52) {
      m2[,i] <- tidylik4(m2[,i])
    }
    

    Feel free to comment as no doubt this will come up again!

    0 讨论(0)
  • 2021-02-05 12:49

    You can export it in a convenient form that fits R from Surveymonkey, see download responses in 'Advanced Spreadsheet Format'

    surveymonkey export

    0 讨论(0)
  • 2021-02-05 12:51

    How about the following: use read.csv() with header=FALSE. Make two arrays, one with the two lines of headings and one with the answers to the survey. Then paste() the two rows/sentences of together. Finally, use colnames().

    0 讨论(0)
  • 2021-02-05 12:54

    I have to deal with this pretty frequently, and having the headers on two columns is a bit painful. This function fixes that issue so that you only have a 1 row header to deal with. It also joins the multipunch questions so you have top: bottom style naming.

    #' @param x The path to a surveymonkey csv file
    fix_names <- function(x) {
      rs <- read.csv(
        x,
        nrows = 2,
        stringsAsFactors = FALSE,
        header = FALSE,
        check.names = FALSE, 
        na.strings = "",
        encoding = "UTF-8"
      )
    
      rs[rs == ""] <- NA
      rs[rs == "NA"] <- "Not applicable"
      rs[rs == "Response"] <- NA
      rs[rs == "Open-Ended Response"] <- NA
    
      nms <- c()
    
      for(i in 1:ncol(rs)) {
    
        current_top <- rs[1,i]
        current_bottom <- rs[2,i]
    
        if(i + 1 < ncol(rs)) {
          coming_top <- rs[1, i+1]
          coming_bottom <- rs[2, i+1]
        }
    
        if(is.na(coming_top) & !is.na(current_top) & (!is.na(current_bottom) | grepl("^Other", coming_bottom)))
          pre <- current_top
    
        if((is.na(current_top) & !is.na(current_bottom)) | (!is.na(current_top) & !is.na(current_bottom)))
          nms[i] <- paste0(c(pre, current_bottom), collapse = " - ")
    
        if(!is.na(current_top) & is.na(current_bottom))
          nms[i] <- current_top
    
      }
    
    
      nms
    }
    

    If you note, it returns the names only. I typically just read.csv with ...,skip=2, header = FALSE, save to a variable and overwrite the names of the variable. It also helps ALOT to set your na.strings and stringsAsFactor = FALSE.

    nms = fix_names("path/to/csv")
    d = read.csv("path/to/csv", skip = 2, header = FALSE)
    names(d) = nms 
    
    0 讨论(0)
提交回复
热议问题