Remove empty columns from read_csv()

问题

I am trying to read the csv file linked here using read_csv() from the readr package, and then remove empty columns.

If I use read.csv() instead, then the empty columns 8:12 can easily be removed using

library(dplyr)    
select(data, 1:7)

However, when I read the csv file using the read_csv() function, then the same code gives an error;

Error: found duplicated column name: NA, NA, NA, NA

How can I remove these empty columns?

It seems pointless to properly name empty columns just so I can remove them. I would prefer to use read_csv() rather than read.csv() as it makes life a bit easier later on in the analysis.

回答1:

You could do:

data <- data[,apply(data, 2, function(x) { sum(!is.na(x)) > 0 })]

This will keep only columns which are not entirely NA.

Or, if you have dplyr 0.5 installed, you can use the new select_if function to achieve the same effect:

has_data <- function(x) { sum(!is.na(x)) > 0 }
data <- data %>% select_if(has_data)

回答2:

I'm not sure about read_csv, but if you use read.csv, specify colClasses as "NULL" for the columns you don't want, you'll get what you're after (adjust the integers in the rep calls as needed:

read.csv( file = [yourfile],
        colClasses = c( rep("character",3), rep("NULL",5) )
)

The above will return only the first 3 columns, and disregard the following 5 columns.

ALTERNATIVE ANSWER:
Have you tried fread? It has a select argument, which might be useful for you, eg:

fread( [filename], select = c(1:3) )

It also has the benefit of being quite a bit faster than read.csv and read_csv. Here's a speed test with a particular file I have :

microbenchmark::microbenchmark( 
fread = {rangerdata2 <- data.table::fread( filename, select = c(1:3) )}, 
read.csv = {rangerdata2 <- utils::read.csv( file = filename )[,1:3]}, 
read_csv = {rangerdata2 <- readr::read_csv( file = filename )[,1:3]}, 
times = 1000)

Unit: milliseconds
 expr      min       lq      mean    median        uq      max neval cld
 fread    1.22161  1.32841  1.464724  1.377178  1.442089  14.57102  1000 a  
 read.csv 18.25402 18.55992 19.664278 18.772855 19.565684 34.87589  1000   c
 read_csv 13.43166 13.76704 14.615746 13.975987 14.608822 33.36244  1000  b

回答3:

Once your csv file is loaded into R as a data frame, you could do (assuming your data frame is called dat):

dat = dat[, sapply(dat, function(i) !all(is.na(i)))]

Initially, I was thinking that if you use read_csv you could do:

dat = dat[, !is.na(names(dat))]

because read_csv sets the names of all the empty columns to NA. However, this could be dangerous. If you have a column with no name in the first row, but some data, that column's name would also be NA and it would be deleted as well.

来源：https://stackoverflow.com/questions/38088329/remove-empty-columns-from-read-csv

标签

dplyr

readr