I'm trying to analyse a large survey created with surveymonkey which has hundreds of columns in the CSV file and the output format is difficult to use as the headers run over two lines.
- Has anybody found a simple way of managing the headers in the CSV file so that the analysis is manageable ?
- How do other people analyse results from Surveymonkey?
You can export it in a convenient form that fits R from Surveymonkey, see download responses in 'Advanced Spreadsheet Format'
What I did in the end was print out the headers using libreoffice labeled as V1,V2, etc. then I just read in the file as
m1 <- read.csv('Sheet1.csv', header=FALSE, skip=1)
and then just did the analysis against m1$V10, m1$V23 etc...
To get around the mess of multiple columns I used the following little function
# function to merge columns into one with a space separator and then
# remove multiple spaces
mcols <- function(df, cols) {
# e.g. mcols(df, c(14:18))
exp <- paste('df[,', cols, ']', sep='', collapse=',' )
# this creates something like...
# "df[,14],df[,15],df[,16],df[,17],df[,18]"
# now we just want to do a paste of this expression...
nexp <- paste(" paste(", exp, ", sep=' ')")
# so now nexp looks something like...
# " paste( df[,14],df[,15],df[,16],df[,17],df[,18] , sep='')"
# now we just need to parse this text... and eval() it...
newcol <- eval(parse(text=nexp))
newcol <- gsub(' *', ' ', newcol) # replace duplicate spaces by a single one
newcol <- gsub('^ *', '', newcol) # remove leading spaces
gsub(' *$', '', newcol) # remove trailing spaces
# mcols(df, c(14:18))
No doubt somebody will be able to clean this up!
To tidy up Likert-like scales I used:
# function to tidy c('Strongly Agree', 'Agree', 'Disagree', 'Strongly Disagree')
tidylik4 <- function(x) {
xlevels <- c('Strongly Disagree', 'Disagree', 'Agree', 'Strongly Agree')
y <- ifelse(x == '', NA, x)
ordered(y, levels=xlevels)
for (i in 44:52) {
m2[,i] <- tidylik4(m2[,i])
Feel free to comment as no doubt this will come up again!
As of November 2013, the webpage layout seems to have changed. Choose Analyze results > Export All > All Responses Data > Original View > XLS+ (Open in advanced statistical and analytical software)
. Then go to Exports and download the file. You'll get raw data as first row = question headers / each following row = 1 response, possibly split between multiple files if you have many responses / questions.
I have to deal with this pretty frequently, and having the headers on two columns is a bit painful. This function fixes that issue so that you only have a 1 row header to deal with. It also joins the multipunch questions so you have top: bottom style naming.
#' @param x The path to a surveymonkey csv file
fix_names <- function(x) {
rs <- read.csv(
nrows = 2,
stringsAsFactors = FALSE,
header = FALSE,
check.names = FALSE,
na.strings = "",
encoding = "UTF-8"
rs[rs == ""] <- NA
rs[rs == "NA"] <- "Not applicable"
rs[rs == "Response"] <- NA
rs[rs == "Open-Ended Response"] <- NA
nms <- c()
for(i in 1:ncol(rs)) {
current_top <- rs[1,i]
current_bottom <- rs[2,i]
if(i + 1 < ncol(rs)) {
coming_top <- rs[1, i+1]
coming_bottom <- rs[2, i+1]
if(is.na(coming_top) & !is.na(current_top) & (!is.na(current_bottom) | grepl("^Other", coming_bottom)))
pre <- current_top
if((is.na(current_top) & !is.na(current_bottom)) | (!is.na(current_top) & !is.na(current_bottom)))
nms[i] <- paste0(c(pre, current_bottom), collapse = " - ")
if(!is.na(current_top) & is.na(current_bottom))
nms[i] <- current_top
If you note, it returns the names only. I typically just read.csv with ...,skip=2, header = FALSE
, save to a variable and overwrite the names of the variable. It also helps ALOT to set your na.strings
and stringsAsFactor = FALSE
nms = fix_names("path/to/csv")
d = read.csv("path/to/csv", skip = 2, header = FALSE)
names(d) = nms
The issue with the headers is that columns with "select all that apply" will have a blank top row, and the column heading will be the row below. This is only an issue for those types of questions.
With this in mind, I wrote a loop to go through all columns and replace the column names with the value from the second row if the column name was blank- which has a character length of 1.
Then, you can kill the second row of the data and have a tidy data frame.
for(i in 1:ncol(df)){
newname <- colnames(df)[i]
if(nchar(newname) < 2){
colnames(df)[i] <- df[1,i]
df <- df[-1,]
Coming to the party late, but this is still an issue and the best workaround I've found is using a function to paste the column names and sub-column names together, based on repeating values.
For instance, if exporting to .csv
, the repeated column names will automatically be replaced with an X
in RStudio. If exporting to .xlsx
, the repeated value will be ...
Here's a base R
sm_header_function <- function(x, rep_val){
orig <- x
sv <- x
sv <- sv[1,]
sv <- sv[, sapply(sv, Negate(anyNA)), drop = FALSE]
sv <- t(sv)
sv <- cbind(rownames(sv), data.frame(sv, row.names = NULL))
names(sv)[1] <- "name"
names(sv)[2] <- "value"
sv$grp <- with(sv, ave(name, FUN = function(x) cumsum(!startsWith(name, rep_val))))
sv$new_value <- with(sv, ave(name, grp, FUN = function(x) head(x, 1)))
sv$new_value <- paste0(sv$new_value, " ", sv$value)
new_names <- as.character(sv$new_value)
colnames(orig)[which(colnames(orig) %in% sv$name)] <- sv$new_value
orig <- orig[-c(1),]
sm_header_function(df, "X")
sm_header_function(df, "...")
With some sample data, the change in column names would look like this:
Original export from SurveyMonkey:
> colnames(sample)
[1] "Respondent ID" "Please provide your contact information:" "...11"
[4] "...12" "...13" "...14"
[7] "...15" "...16" "...17"
[10] "...18" "...19" "I wish it would have snowed more this winter."
Cleaned export from SurveyMonkey:
> colnames(sample_clean)
[1] "Respondent ID" "Please provide your contact information: Name"
[3] "Please provide your contact information: Company" "Please provide your contact information: Address"
[5] "Please provide your contact information: Address 2" "Please provide your contact information: City/Town"
[7] "Please provide your contact information: State/Province" "Please provide your contact information: ZIP/Postal Code"
[9] "Please provide your contact information: Country" "Please provide your contact information: Email Address"
[11] "Please provide your contact information: Phone Number" "I wish it would have snowed more this winter. Response"
Sample data:
structure(list(`Respondent ID` = c(NA, 11385284375, 11385273621,
11385258069, 11385253194, 11385240121, 11385226951, 11385212508
), `Please provide your contact information:` = c("Name", "Benjamin Franklin",
"Mae Jemison", "Carl Sagan", "W. E. B. Du Bois", "Florence Nightingale",
"Galileo Galilei", "Albert Einstein"), ...11 = c("Company", "Poor Richard's",
"NASA", "Smithsonian", "NAACP", "Public Health Co", "NASA", "ThinkTank"
), ...12 = c("Address", NA, NA, NA, NA, NA, NA, NA), ...13 = c("Address 2",
NA, NA, NA, NA, NA, NA, NA), ...14 = c("City/Town", "Philadelphia",
"Decatur", "Washington", "Great Barrington", "Florence", "Pisa",
"Princeton"), ...15 = c("State/Province", "PA", "Alabama", "D.C.",
"MA", "IT", "IT", "NJ"), ...16 = c("ZIP/Postal Code", "19104",
"20104", "33321", "1230", "33225", "12345", "8540"), ...17 = c("Country",
NA, NA, NA, NA, NA, NA, NA), ...18 = c("Email Address", "benjamins@gmail.com",
"mjemison@nasa.gov", "stargazer@gmail.com", "dubois@web.com",
"firstnurse@aol.com", "galileo123@yahoo.com", "imthinking@gmail.com"
), ...19 = c("Phone Number", "215-555-4444", "221-134-4646",
"999-999-4422", "999-000-1234", "123-456-7899", "111-888-9944",
"215-999-8877"), `I wish it would have snowed more this winter.` = c("Response",
"Strongly disagree", "Strongly agree", "Neither agree nor disagree",
"Strongly disagree", "Disagree", "Agree", "Strongly agree")), row.names = c(NA,
-8L), class = c("tbl_df", "tbl", "data.frame"))
How about the following: use read.csv()
with header=FALSE
. Make two arrays, one with the two lines of headings and one with the answers to the survey. Then paste()
the two rows/sentences of together. Finally, use colnames()