Reading a non-standard CSV File into R

问题

Im trying to read the following csv file into R

http://asic.gov.au/Reports/YTD/2015/RR20150511-001-SSDailyYTD.csv

The code im currently using is:

url <- "http://asic.gov.au/Reports/YTD/2015/RR20150511-001-SSDailyYTD.csv"
shorthistory <- read.csv(url, skip = 4)

However I keep getting the following error.

1: In readLines(file, skip) : line 1 appears to contain an embedded nul
2: In readLines(file, skip) : line 2 appears to contain an embedded nul
3: In readLines(file, skip) : line 3 appears to contain an embedded nul
4: In readLines(file, skip) : line 4 appears to contain an embedded nul

Which leads me to believe I am utilizing the function incorrectly as it is failing with every line.

Any help would be very much appreciated!

回答1:

Due to the blank at the top left corners, read.csv() doesn't seem to work. The file has to be read line by line (readLines()) followed by skipping the the first 4 lines.

Below shows an example. The file is open as file connection (file()) and then read line by line (readLines()). The first 4 lines are skipped by subsetting. The file is tab-delimited so that strsplit() is applied recursively. Still they are kept as string lists and they should be reformatted as data frame or any other suitable types.

# open file connection and read lines
path <- "http://asic.gov.au/Reports/YTD/2015/RR20150511-001-SSDailyYTD.csv"
con <- file(path, open = "rt", raw = TRUE)
text <- readLines(con, skipNul = TRUE)
close(con)

# skip first 4 lines
text <- text[5:length(text)]
# recursively split string
text <- do.call(c, lapply(text, strsplit, split = "\t"))

text[[1]][1:4]
# [1] "1-PAGE LTD ORDINARY" "1PG "                "1330487"             "1.72"

回答2:

After having lots of issues with CSV files that included a BOM (byte order mark) and NUL, I wrote this little function. It reads the file line-by-line (ignore NUL), skips empty lines, and then applies read.csv.

# Read CSV files with BOM and NUL problems
read.csvX = function(file, encoding="UTF-16LE", header=T, stringsAsFactors=T) {
  csvLines = readLines(file, encoding=encoding, skipNul=T, warn=F)
  # Remove BOM (ÿþ) from first line
  if (substr(csvLines[[1]], 1, 2) == "ÿþ") {
    csvLines[[1]] = substr(csvLines[[1]], 3, nchar(csvLines[[1]]))
  }
  csvLines = csvLines[csvLines != ""]
  if (length(csvLines) == 0) {
    warning("Empty file")
    return(NULL)
  }
  csvData = read.csv(text=paste(csvLines, collapse="\n"), header=header, stringsAsFactors=stringsAsFactors)
  return(csvData)
}

Hope this answer to an old question helps someone.

回答3:

I didnt end up trying readlines, but it turns out the file was in unicode....yes the file was in a terrible format, but ended using the following code to grab just the volume data of the shorts.

  shorthistory <- read.csv("http://asic.gov.au/Reports/YTD/2015/RR20150511-001-SSDailyYTD.csv",skip=1,fileEncoding = "UTF-16",sep = "\t")
  shorthistory <- shorthistory[-(1:2),]
  shorthistory <- cbind(Row.Names = rownames(shorthistory), shorthistory)
  rownames(shorthistory) <- NULL
  colnames(shorthistory) <- substr(colnames(shorthistory),2,11)
  colnames(shorthistory)[1] <- "Company"
  colnames(shorthistory)[2] <- "Ticker"
  shorthist1 <- shorthistory[,1:2]
  i=3 ##start at first volume column with short data
  while(i<=length(colnames(shorthistory))){
    if(i%%2 == 0){
      shorthist1 <- cbind(shorthist1,shorthistory[i])
      i <- i+1
      }
    else{
      i <- i+1
    }
  }
  melted <- melt(data = shorthist1,id = c("Ticker","Company"))
  melted$variable <- as.POSIXlt(x = melted$variable,format = "%Y.%m.%d")
  melted$value[melted$value==""] <- 0.00

来源：https://stackoverflow.com/questions/30251576/reading-a-non-standard-csv-file-into-r

标签

csv

import-from-csv