问题
I am processing the US Weather service Storm Data, which has one large CSV data file for each year from 1950 onwards. The 1999 year file contains several rows with very large freeform text fields which contain embedded NUL characters, in an otherwise vanilla ascii database. (The offending file is at ftp://ftp.ncdc.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d1999_c20140915.csv.gz).
R cannot handle corrupted string data without errors,and this includes R data.frame, data.table, stringr, and stringi package functions (tried).
I can clean the files of NULs with sed, but I would prefer not to use external programs, as this is for an R markdown type report with embedded code.
Suggestions?
回答1:
Maybe this could be of help:
in.file <- file(description = "StormEvents_details-ftp_v1.0_d1999_c20140915.csv",
open = "r")
writeLines(iconv(readLines(in.file), to = "ASCII"),
con = "StormEvents_ascii.csv")
I was able to read the csv file without errors with this call do read.table:
options(stringAsFactors = FALSE)
StormEvents <- read.table("StormEvents_ascii.csv", header = TRUE,
sep = ",", fill = TRUE, quote = '"')
Obviously you'd need to change the class of several columns, since all are considered character as it is.
回答2:
Just for posterity - you can use binary reads (readBin()
) and replace the NUL
s with anything else - see
Removing "NUL" characters (within R)
来源:https://stackoverflow.com/questions/28979857/is-there-a-sed-type-package-in-r-for-removing-embedded-nuls