问题
Let's take the following simplified version of a dataset that I import using read.table
:
a<-as.data.frame(c("M","M","F","F","F"))
b<-as.data.frame(c(25,22,33,17,18))
df<-cbind(a,b)
colnames(df)<-c("Sex","Age")
In reality my dataset is extremely large and I'm only interested in a small proportion of the data i.e. the data concerning Females aged 18 or under. In the example above this would be just the last 2 observations.
My question is, can I just import these observations immediately without importing the rest of the data then using subset
to refine my database. My computer's capacities are limited and so I have been using scan
to import my data in chunks but this is extremely time consuming.
Is there a better solution?
回答1:
Some approaches that might work:
1 - Use a packages like ff
than can help you with RAM issues.
2 - Use other tools/languages to clean your data before load it to R.
3 - If your file is not too big (i.e., you can load it without crashing), then you could save it to a .RData file and read from this file (instead of calling read.table):
# save each txt file once...
save.rdata = function(filepath, filebin) {
dataset = read.table(filepath)
save(dataset, paste(filebin, ".RData", sep = ""))
}
# then read from the .Rdata
get.dataset = function(filebin) {
load(filebin)
return(dataset)
}
This is much faster than read from a txt file, but i'm not sure if it applies to your case.
回答2:
There should be several ways to do this. Here is one using SQL.
library(sqldf)
result = sqldf("select * from df where Sex='F' AND Age<=18")
> result
Sex Age
1 F 17
2 F 18
There is also a read.csv.sql
function that you can filter with the above statement to avoid reading in the whole text file!
回答3:
This is almost the same as @Drew75's answer but I'm including it to illustrate some gotcha's with SQLite:
# example: large-ish data.frame
df <- data.frame(Sex=sample(c("M","F"),1e6,replace=T),
Age=sample(18:75,1e6,replace=T))
write.csv(df, "myData.csv", quote=F, row.names=F) # note: non-quoted strings
library(sqldf)
myData <- read.csv.sql(file="myData.csv", # looks for char M (no qoutes)
sql="select * from file where Sex='M'", eol = "\n")
nrow(myData)
# [1] 500127
write.csv(df, "myData.csv", row.names=F) # quoted strings...
myData <- read.csv.sql(file="myData.csv", # this fails
sql="select * from file where Sex='M'", eol = "\n")
nrow(myData)
# [1] 0
myData <- read.csv.sql(file="myData.csv", # need quotes in the char literal
sql="select * from file where Sex='\"M\"'", eol = "\n")
nrow(myData)
# [1] 500127
来源:https://stackoverflow.com/questions/21486402/add-selection-crteria-to-read-table