问题
Let's say I have a CSV with dozens or hundreds of columns and I want to pull in just about 2 or 3 columns. I know about the colClasses
solution as described here but the code gets very unreadable.
I want something like usecols
from pandas' read_csv.
Loading everything and just selecting afterwards is not a solution (the file is super big, it doesn't fit in memory).
回答1:
I will use package data.table
and then with fread()
specify columns to keep/drop by arguments select
or drop
. From ?fread
select Vector of column names or numbers to keep, drop the rest.
drop Vector of column names or numbers to drop, keep the rest.
Best!
回答2:
One way is to use package sqldf
. If you know SQL, it is possible to read in large files filtering only the parts you want.
I will use built-in dataset iris
to make the example reproducible, saving it to disk first.
write.csv(iris, "iris.csv", row.names = FALSE)
Now the problem.
Argument row.names
is like in the write.csv
instruction.
Note the backticks around Sepal.Length
. This is due to the dot character in the column name.
library(sqldf)
sql <- "select `Sepal.Length`, Species from file"
sub_iris <- read.csv.sql("iris.csv", sql = sql, row.names = FALSE)
head(sub_iris)
# Sepal.Length Species
#1 5.1 "setosa"
#2 4.9 "setosa"
#3 4.7 "setosa"
#4 4.6 "setosa"
#5 5.0 "setosa"
#6 5.4 "setosa"
And final clean up.
unlink("iris.csv")
来源:https://stackoverflow.com/questions/54611277/read-csv-in-r-and-filter-columns-by-name