问题
I have a weird CSV that I can't parse with readr. Let's call it data.csv
. It looks something like this:
name,info,amount_spent
John Doe,Is a good guy,5412030
Jane Doe,"Jan Doe" is cool,3159
Senator Sally Doe,"Sally "Sal" Doe is from New York, NY",4451
If all of the rows were like first one below the columns row – two character columns followed by an integer column – this would be easy to parse with read_csv
:
df <- read_csv("data.csv")
However, some rows are formatted like the second one, in that the second column ("info") contains a string, part of which is enclosed by double quotes and part of which is not. This makes it so read_csv doesn't read the comma after the word cool
as a delimiter, and the entire following row gets appended to the offending cell.
A solution for this kind of problem is to pass FALSE
to the escape_double
argument in read_delim()
, like so:
df <- read_delim("data.csv", delim = ",", escape_double = FALSE)
This works for the second row, but gets killed by the third, where the second column contains a string enclosed by double quotes which itself contains nested double quotes and a comma.
I have read the readr documentation but have as yet found no solution that would parse both types of rows.
回答1:
Here is what worked for me with the example specified.
Used read.csv rather than read_csv. This means I am using a dataframe rather than a tibble.
#Read the csv, just turned the table you had as an example to a csv.
#That resulted as a csv with one column
a <- read.csv(file = "Book1.csv", header=T)
#Replace the comma in the third(!) line with just space
a[,1] <- str_replace_all(as.vector(a[,1]), ", ", " ")
#Use seperate from the tidyer package to split the column to three columns
#and convert to a tibble
a <- a %>% separate(name.info.amount_spent, c("name", "info", "spent"), ",")%>%
as_tibble(a)
glimpse(a)
$name <chr> "John Doe", "Jane Doe", "Senator Sally Doe"
$info <chr> "Is a good guy", "\"Jan Doe\" is cool", "\"Sally \"Sal\" Doe is from New York NY\""
$spent <chr> "5412030", "3159", "4451"
回答2:
You could use a regular expression which splits on the comma in question (using (*SKIP)(*FAIL)
):
input <- c('John Doe,Is a good guy,5412030', 'Jane Doe,"Jan Doe" is cool,3159',
'Senator Sally Doe,"Sally "Sal" Doe is from New York, NY",4451')
lst <- strsplit(input, '"[^"]*"(*SKIP)(*FAIL)|,', perl = T)
(df <- setNames(as.data.frame(do.call(rbind, lst)), c("name","info","amount_spent")))
This yields
name info amount_spent
1 John Doe Is a good guy 5412030
2 Jane Doe "Jan Doe" is cool 3159
3 Senator Sally Doe "Sally "Sal" Doe is from New York, NY" 4451
See a demo for the expression on regex101.com.
来源:https://stackoverflow.com/questions/54638300/parsing-a-csv-with-irregular-quoting-rules-using-readr