Add selection crteria to read.table

问题

Let's take the following simplified version of a dataset that I import using read.table:

a<-as.data.frame(c("M","M","F","F","F"))
b<-as.data.frame(c(25,22,33,17,18))
df<-cbind(a,b)
colnames(df)<-c("Sex","Age")

In reality my dataset is extremely large and I'm only interested in a small proportion of the data i.e. the data concerning Females aged 18 or under. In the example above this would be just the last 2 observations.

My question is, can I just import these observations immediately without importing the rest of the data then using subset to refine my database. My computer's capacities are limited and so I have been using scan to import my data in chunks but this is extremely time consuming.

Is there a better solution?

回答1:

Some approaches that might work:

1 - Use a packages like ff than can help you with RAM issues.

2 - Use other tools/languages to clean your data before load it to R.

3 - If your file is not too big (i.e., you can load it without crashing), then you could save it to a .RData file and read from this file (instead of calling read.table):

 # save each txt file once...
 save.rdata = function(filepath, filebin) {
     dataset = read.table(filepath)
     save(dataset, paste(filebin, ".RData", sep = ""))
 }

 # then read from the .Rdata
 get.dataset = function(filebin) {
     load(filebin)
     return(dataset)
 }

This is much faster than read from a txt file, but i'm not sure if it applies to your case.

回答2:

There should be several ways to do this. Here is one using SQL.

library(sqldf)
result = sqldf("select * from df where Sex='F' AND Age<=18")

> result
  Sex Age
1   F  17
2   F  18

There is also a read.csv.sql function that you can filter with the above statement to avoid reading in the whole text file!

回答3:

This is almost the same as @Drew75's answer but I'm including it to illustrate some gotcha's with SQLite:

# example: large-ish data.frame
df <- data.frame(Sex=sample(c("M","F"),1e6,replace=T),
                 Age=sample(18:75,1e6,replace=T))
write.csv(df, "myData.csv", quote=F, row.names=F)  # note: non-quoted strings

library(sqldf)
myData <- read.csv.sql(file="myData.csv",       # looks for char M (no qoutes)
                       sql="select * from file where Sex='M'", eol = "\n")
nrow(myData)
# [1] 500127

write.csv(df, "myData.csv", row.names=F)        # quoted strings...
myData <- read.csv.sql(file="myData.csv",       # this fails
                       sql="select * from file where Sex='M'", eol = "\n")
nrow(myData)
# [1] 0
myData <- read.csv.sql(file="myData.csv",       # need quotes in the char literal
                       sql="select * from file where Sex='\"M\"'", eol = "\n")
nrow(myData)
# [1] 500127

来源：https://stackoverflow.com/questions/21486402/add-selection-crteria-to-read-table

标签

import

read.table