R: Is there a way to subset a file while reading

后端 未结 3 334
甜味超标
甜味超标 2021-01-14 14:10

I have a huge .csv file, its size is ~ 1.4G and reading with read.csv takes time. There are several variables in that file and all i want is to ext

相关标签:
3条回答
  • 2021-01-14 14:51

    Just wondering if doing this works. It worked for my code but I am not sure whether it is first reading in the entire data and then subsetting or is it only reading the part of the file where Variables == 'X'.

    temp <- fread('dat.csv')[Variables == 'X']
    
    0 讨论(0)
  • 2021-01-14 14:53

    Check out the LaF package, it allows to read very large textfiles in blocks, so you don't have to read the entire file into memory.

    library(LaF)
    
    data_model <- detect_dm_csv("yourFile.csv", skip = 1) # detects the file structure
    dat <- laf_open(data_model) # opens connection to the file
    
    block_list <- lapply(seq(1,100000,1000), function(row_num){
        goto(dat, row_num)
        data_block <- next_block(dat, nrows = 1000) # reads data blocks of 1000 rows
        data_block <- data_block[data_block$Variables == "X",]
        return(data_block)
    })
    your_df <- do.call("rbind", block_list)
    

    Admittedly, the package sometimes feels a bit bulky and in some situations I had to find small hacks to get my results (you might have to adapt my solution for your data). Nevertheless, I found it a immensely useful solution for dealing with files that exceeded my RAM.

    0 讨论(0)
  • 2021-01-14 14:57

    I would say that most of the time you can probably just read in the entire file, and then subset within R:

    df <- read.csv(file="path/to/your/file.csv", header=TRUE)
    df.x <- df[df$Variables=='x', ]
    

    R operates completely in memory, so an exception to what I said above might occur if you have a file whose total size is so massive that it cannot fit into memory, but for some reason the subset of interest can.

    0 讨论(0)
提交回复
热议问题