Is it possible to get the number of rows in a CSV file without opening it?

前端 未结 4 1517
逝去的感伤
逝去的感伤 2020-12-03 10:13

I have a CSV file of size ~1 GB, and as my laptop is of basic configuration, I\'m not able to open the file in Excel or R. But out of curiosity, I would like to get the numb

相关标签:
4条回答
  • 2020-12-03 10:27

    Here is something I used:

    testcon <- file("xyzfile.csv",open="r")
    readsizeof <- 20000
    nooflines <- 0
    ( while((linesread <- length(readLines(testcon,readsizeof))) > 0 ) 
    nooflines <- nooflines+linesread )
    close(testcon)
    nooflines
    

    Check out this post for more: https://www.r-bloggers.com/easy-way-of-determining-number-of-linesrecords-in-a-given-large-file-using-r/

    0 讨论(0)
  • 2020-12-03 10:33

    Estimate number of lines based on size of first 1000 lines

    size1000  <- sum(nchar(readLines(con = "dgrp2.tgeno", n = 1000)))
    
    sizetotal <- file.size("dgrp2.tgeno")
    1000 *  sizetotal / size1000
    

    This is usually good enough for most purposes - and is a lot faster for huge files.

    0 讨论(0)
  • 2020-12-03 10:38

    For Linux/Unix:

    wc -l filename
    

    For Windows:

    find /c /v "A String that is extremely unlikely to occur" filename
    
    0 讨论(0)
  • 2020-12-03 10:40

    Option 1:

    Through a file connection, count.fields() counts the number of fields per line of the file based on some sep value (that we don't care about here). So if we take the length of that result, theoretically we should end up with the number of lines (and rows) in the file.

    length(count.fields(filename))
    

    If you have a header row, you can skip it with skip = 1

    length(count.fields(filename, skip = 1))
    

    There are other arguments that you can adjust for your specific needs, like skipping blank lines.

    args(count.fields)
    # function (file, sep = "", quote = "\"'", skip = 0, blank.lines.skip = TRUE, 
    #     comment.char = "#") 
    # NULL
    

    See help(count.fields) for more.

    It's not too bad as far as speed goes. I tested it on one of my baseball files that contains 99846 rows.

    nrow(data.table::fread("Batting.csv"))
    # [1] 99846
    
    system.time({ l <- length(count.fields("Batting.csv", skip = 1)) })
    #   user  system elapsed 
    #  0.528   0.000   0.503 
    
    l
    # [1] 99846
    file.info("Batting.csv")$size
    # [1] 6153740
    

    (The more efficient) Option 2: Another idea is to use data.table::fread() to read the first column only, then take the number of rows. This would be very fast.

    system.time(nrow(fread("Batting.csv", select = 1L)))
    #   user  system elapsed 
    #  0.063   0.000   0.063 
    
    0 讨论(0)
提交回复
热议问题