Parsing a CSV with irregular quoting rules using readr

问题

I have a weird CSV that I can't parse with readr. Let's call it data.csv. It looks something like this:

name,info,amount_spent
John Doe,Is a good guy,5412030
Jane Doe,"Jan Doe" is cool,3159
Senator Sally Doe,"Sally "Sal" Doe is from New York, NY",4451

If all of the rows were like first one below the columns row – two character columns followed by an integer column – this would be easy to parse with read_csv:

df <- read_csv("data.csv")

However, some rows are formatted like the second one, in that the second column ("info") contains a string, part of which is enclosed by double quotes and part of which is not. This makes it so read_csv doesn't read the comma after the word cool as a delimiter, and the entire following row gets appended to the offending cell.

A solution for this kind of problem is to pass FALSE to the escape_double argument in read_delim(), like so:

df <- read_delim("data.csv", delim = ",", escape_double = FALSE)

This works for the second row, but gets killed by the third, where the second column contains a string enclosed by double quotes which itself contains nested double quotes and a comma.

I have read the readr documentation but have as yet found no solution that would parse both types of rows.

回答1:

Here is what worked for me with the example specified.

Used read.csv rather than read_csv. This means I am using a dataframe rather than a tibble.

#Read the csv, just turned the table you had as an example to a csv.
#That resulted as a csv with one column
a <- read.csv(file = "Book1.csv", header=T) 

#Replace the comma in the third(!) line with just space
a[,1] <-  str_replace_all(as.vector(a[,1]), ", ", " ")

#Use seperate from the tidyer package to split the column to three columns
#and convert to a tibble
a <- a %>% separate(name.info.amount_spent, c("name", "info", "spent"), ",")%>%
as_tibble(a)
glimpse(a)
 $name  <chr> "John Doe", "Jane Doe", "Senator Sally Doe"
 $info  <chr> "Is a good guy", "\"Jan Doe\" is cool", "\"Sally \"Sal\" Doe is from New York NY\""
 $spent <chr> "5412030", "3159", "4451"

回答2:

You could use a regular expression which splits on the comma in question (using (*SKIP)(*FAIL)):

input <- c('John Doe,Is a good guy,5412030', 'Jane Doe,"Jan Doe" is cool,3159',
           'Senator Sally Doe,"Sally "Sal" Doe is from New York, NY",4451')

lst <- strsplit(input, '"[^"]*"(*SKIP)(*FAIL)|,', perl = T)

(df <- setNames(as.data.frame(do.call(rbind, lst)), c("name","info","amount_spent")))

This yields

               name                                   info amount_spent
1          John Doe                          Is a good guy      5412030
2          Jane Doe                      "Jan Doe" is cool         3159
3 Senator Sally Doe "Sally "Sal" Doe is from New York, NY"         4451

See a demo for the expression on regex101.com.

来源：https://stackoverflow.com/questions/54638300/parsing-a-csv-with-irregular-quoting-rules-using-readr

标签

regex

tidyverse

readr