问题
I am trying to read the csv file linked here using read_csv()
from the readr package, and then remove empty columns.
If I use read.csv()
instead, then the empty columns 8:12 can easily be removed using
library(dplyr)
select(data, 1:7)
However, when I read the csv file using the read_csv()
function, then the same code gives an error;
Error: found duplicated column name: NA, NA, NA, NA
How can I remove these empty columns?
It seems pointless to properly name empty columns just so I can remove them. I would prefer to use read_csv()
rather than read.csv()
as it makes life a bit easier later on in the analysis.
回答1:
You could do:
data <- data[,apply(data, 2, function(x) { sum(!is.na(x)) > 0 })]
This will keep only columns which are not entirely NA
.
Or, if you have dplyr 0.5 installed, you can use the new select_if
function to achieve the same effect:
has_data <- function(x) { sum(!is.na(x)) > 0 }
data <- data %>% select_if(has_data)
回答2:
I'm not sure about read_csv
, but if you use read.csv
, specify colClasses
as "NULL" for the columns you don't want, you'll get what you're after (adjust the integers in the rep
calls as needed:
read.csv( file = [yourfile],
colClasses = c( rep("character",3), rep("NULL",5) )
)
The above will return only the first 3 columns, and disregard the following 5 columns.
ALTERNATIVE ANSWER:
Have you tried fread
? It has a select
argument, which might be useful for you, eg:
fread( [filename], select = c(1:3) )
It also has the benefit of being quite a bit faster than read.csv and read_csv. Here's a speed test with a particular file I have :
microbenchmark::microbenchmark(
fread = {rangerdata2 <- data.table::fread( filename, select = c(1:3) )},
read.csv = {rangerdata2 <- utils::read.csv( file = filename )[,1:3]},
read_csv = {rangerdata2 <- readr::read_csv( file = filename )[,1:3]},
times = 1000)
Unit: milliseconds
expr min lq mean median uq max neval cld
fread 1.22161 1.32841 1.464724 1.377178 1.442089 14.57102 1000 a
read.csv 18.25402 18.55992 19.664278 18.772855 19.565684 34.87589 1000 c
read_csv 13.43166 13.76704 14.615746 13.975987 14.608822 33.36244 1000 b
回答3:
Once your csv file is loaded into R as a data frame, you could do (assuming your data frame is called dat
):
dat = dat[, sapply(dat, function(i) !all(is.na(i)))]
Initially, I was thinking that if you use read_csv
you could do:
dat = dat[, !is.na(names(dat))]
because read_csv
sets the names of all the empty columns to NA
. However, this could be dangerous. If you have a column with no name in the first row, but some data, that column's name would also be NA
and it would be deleted as well.
来源:https://stackoverflow.com/questions/38088329/remove-empty-columns-from-read-csv