Is there a sed type package in R for removing embedded NULs?

故事扮演 提交于 2019-12-12 14:58:47

问题


I am processing the US Weather service Storm Data, which has one large CSV data file for each year from 1950 onwards. The 1999 year file contains several rows with very large freeform text fields which contain embedded NUL characters, in an otherwise vanilla ascii database. (The offending file is at ftp://ftp.ncdc.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d1999_c20140915.csv.gz).

R cannot handle corrupted string data without errors,and this includes R data.frame, data.table, stringr, and stringi package functions (tried).

I can clean the files of NULs with sed, but I would prefer not to use external programs, as this is for an R markdown type report with embedded code.

Suggestions?


回答1:


Maybe this could be of help:

in.file <- file(description = "StormEvents_details-ftp_v1.0_d1999_c20140915.csv", 
                open = "r")
writeLines(iconv(readLines(in.file), to = "ASCII"), 
           con = "StormEvents_ascii.csv")

I was able to read the csv file without errors with this call do read.table:

options(stringAsFactors = FALSE)
StormEvents <- read.table("StormEvents_ascii.csv", header = TRUE, 
                           sep = ",", fill = TRUE, quote = '"')

Obviously you'd need to change the class of several columns, since all are considered character as it is.




回答2:


Just for posterity - you can use binary reads (readBin()) and replace the NULs with anything else - see Removing "NUL" characters (within R)



来源:https://stackoverflow.com/questions/28979857/is-there-a-sed-type-package-in-r-for-removing-embedded-nuls

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!