Reading specific rows of large matrix data file

前端 未结 4 1011
粉色の甜心
粉色の甜心 2021-02-09 22:50

Suppose I have a gigantic m*n matrix X (that is too big to read into memory) and binary numeric vector V with length m. My objective is to

相关标签:
4条回答
  • 2021-02-09 22:54

    I think you can use sqldf package to achieve what you want. sqldf reads the csv file directly into SQLlite database bypassing R environment altogether.

    library(sqldf)
    
    Xfile <- file('target.csv')
    sqlcmd <- paste0('select * from Xfile where rowid in (', paste(which(V==1), collapse=','), ')')
    sqldf(sqlcmd, file.format=list(header=TRUE))
    

    or:

    library(sqldf)
    
    Vdf <- data.frame(V)
    sqlcmd <- "select file.* from file, Vdf on file.rowid = Vdf.rowid and V = 1"
    read.csv.sql("target.csv", sql = sqlcmd)
    
    0 讨论(0)
  • 2021-02-09 22:55

    ffdfindexget from ff package is what you are looking for:

    Function ffdfindexget allows to extract rows from an ffdf data.frame according to positive integer suscripts stored in an ff vector.

    So in your example:

    write.csv(X,"target.csv")
    d <- read.csv.ffdf(file="target.csv")
    i <- ff(which(V==1))
    di <- ffdfindexget(d, i)
    
    0 讨论(0)
  • 2021-02-09 23:07

    How about you use the command line tool sed, constructing a command that passes along the lines you want to read in the command. I am not sure if there would be some command length limit on this...

    #  Check the data
    head( X )
    #           [,1]        [,2]       [,3]       [,4]        [,5]
    #[1,]  0.2588798  0.42229528  0.4469073  1.0684309  1.35519389
    #[2,]  1.0267562  0.80299223 -0.2768111 -0.7017247 -0.06575137
    #[3,]  1.0110365 -0.36998260 -0.8543176  1.6237827 -1.33320291
    #[4,]  1.5968757  2.13831188  0.6978655 -0.5697239 -1.53799156
    #[5,]  0.1284392  0.55596342  0.6919573  0.6558735 -1.69494827
    #[6,] -0.2406540 -0.04807381 -1.1265165 -0.9917737  0.31186670
    
    #  Check V, note row 6 above should be skipped according to this....
    head(V)
    # [1] 1 1 1 1 1 0
    
    #  Get line numbers we want to read
    head( which( V == 1 ) )
    # [1] 1 2 3 4 5 7
    
    #  Read the first 5 lines where V == '1' in your example (remembering to include an extra line for the header row, hence the +1 in 'which()')
    read.csv( pipe( paste0("sed -n '" , paste0( c( 1 , which( V == 1 )[1:6] + 1 ) , collapse = "p; " ) , "p' C:/Data/target.csv" , collapse = "" ) ) , head=TRUE)
    
    #  X        V1         V2         V3         V4          V5
    #1 1 0.2588798  0.4222953  0.4469073  1.0684309  1.35519389
    #2 2 1.0267562  0.8029922 -0.2768111 -0.7017247 -0.06575137
    #3 3 1.0110365 -0.3699826 -0.8543176  1.6237827 -1.33320291
    #4 4 1.5968757  2.1383119  0.6978655 -0.5697239 -1.53799156
    #5 5 0.1284392  0.5559634  0.6919573  0.6558735 -1.69494827
    #6 7 0.6856038  0.1082029  0.1523561 -1.4147429 -0.64041290
    

    The command we are actually passing to sed is...

     "sed -n '1p; 2p; 3p; 4p; 5p; 6p; 8p' C:/Data/target.csv"
    

    We use -n to turn off printing of any lines, and then we use a semi-colon separated vector of lines numbers that we do want to read, given to us by which( V == 1 ), and finally the target filename. Remember these line numbers have been offset by +1 to account for the line that makes up the header row.

    0 讨论(0)
  • 2021-02-09 23:08

    A viable strategy is to import the CSV file into a database (R supports connections to most of them. Go with MonetDB if you're feeling bleeding edge and speed matters; SQLite or whatever is handy otherwise).

    Then you can specify the appropriate subset using SQL and read that into R.

    0 讨论(0)
提交回复
热议问题