R: Is there a way to subset a file while reading

后端未结

关注

 3  340

I have a huge .csv file, its size is ~ 1.4G and reading with read.csv takes time. There are several variables in that file and all i want is to ext

相关标签:

3条回答

面向向阳花

2021-01-14 14:51
Just wondering if doing this works. It worked for my code but I am not sure whether it is first reading in the entire data and then subsetting or is it only reading the part of the file where Variables == 'X'.
```
temp <- fread('dat.csv')[Variables == 'X']
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
囚心锁ツ

2021-01-14 14:53
Check out the LaF package, it allows to read very large textfiles in blocks, so you don't have to read the entire file into memory.
```
library(LaF)

data_model <- detect_dm_csv("yourFile.csv", skip = 1) # detects the file structure
dat <- laf_open(data_model) # opens connection to the file

block_list <- lapply(seq(1,100000,1000), function(row_num){
    goto(dat, row_num)
    data_block <- next_block(dat, nrows = 1000) # reads data blocks of 1000 rows
    data_block <- data_block[data_block$Variables == "X",]
    return(data_block)
})
your_df <- do.call("rbind", block_list)
```
Admittedly, the package sometimes feels a bit bulky and in some situations I had to find small hacks to get my results (you might have to adapt my solution for your data). Nevertheless, I found it a immensely useful solution for dealing with files that exceeded my RAM.
0 讨论(0)
发布评论:

提交评论
- 加载中...
不思量自难忘°

2021-01-14 14:57
I would say that most of the time you can probably just read in the entire file, and then subset within R:
```
df <- read.csv(file="path/to/your/file.csv", header=TRUE)
df.x <- df[df$Variables=='x', ]
```
R operates completely in memory, so an exception to what I said above might occur if you have a file whose total size is so massive that it cannot fit into memory, but for some reason the subset of interest can.
0 讨论(0)
发布评论:

提交评论
- 加载中...