R's read.csv() omitting rows

问题

In R, I'm trying to read in a basic CSV file of about 42,900 rows (confirmed by Unix's wc -l). The relevant code is

vecs <- read.csv("feature_vectors.txt", header=FALSE, nrows=50000)

where nrows is a slight overestimate because why not. However,

>> dim(vecs)
[1] 16853     5

indicating that the resultant data frame has on the order of 17,000 rows. Is this a memory issue? Each row consists of a ~30 character hash code, a ~30 character string, and 3 integers, so the total size of the file is only about 4MB.

If it's relevant, I should also note that a lot of the rows have missing fields.

Thanks for your help!

回答1:

This sort of problem is often easy to resolve using count.fields, which tells you how many columns the resulting data frame would have if you called read.csv.

(n_fields <- count.fields("feature_vectors.txt"))

If not all the values of n_fields are the same, you have a problem.

if(any(diff(n_fields)))
{
  warning("There's a problem with the file")
}

In that case look at values of n_fields that are different to what you expect: the problems occur in these rows.

As Justin mentioned, a common problem is unmatched quotes. Open you CSV file and find out how strings are quoted there. Then call read.csv, specifying the quote argument.

回答2:

My guess is that you have embedded unmatched ". So some of your rows are actually much longer than they should be. I'd do something like apply(vecs, 2, function(x), max(nchar(as.character(x))) to check.

来源：https://stackoverflow.com/questions/11320372/rs-read-csv-omitting-rows

标签

read.csv