Get the number of lines in a text file using R

后端 未结 5 1245
伪装坚强ぢ
伪装坚强ぢ 2020-12-05 10:51

Is there a way to get the number of lines in a file without importing it?

So far this is what I am doing

myfiles <- list.files(pattern=\"*.dat\")
         


        
相关标签:
5条回答
  • 2020-12-05 11:08

    You can count the number of newline characters (\n, will also work for \r\n on Windows) in a file. This will give you a correct answer iff:

    1. There is a newline char at the end of last line (BTW, read.csv gives a warning if this doesn't hold)
    2. The table does not contain a newline character in the data (e.g. within quotes)

    I'll suffice to read the file in parts. Below I set chunk (tmp buf) size of 65536 bytes:

    f <- file("filename.csv", open="rb")
    nlines <- 0L
    while (length(chunk <- readBin(f, "raw", 65536)) > 0) {
       nlines <- nlines + sum(chunk == as.raw(10L))
    }
    print(nlines)
    close(f)
    

    Benchmarks on a ca. 512 MB ASCII text file, 12101000 text lines, Linux:

    • readBin: ca. 2.4 s.

    • @luis_js's wc-based solution: 0.1 s.

    • read.delim: 39.6 s.

    • EDIT: reading a file line by line with readLines (f <- file("/tmp/test.txt", open="r"); nlines <- 0L; while (length(l <- readLines(f, 128)) > 0) nlines <- nlines + length(l); close(f)): 32.0 s.

    0 讨论(0)
  • 2020-12-05 11:13

    If you:

    • still want to avoid the system call that a system2("wc"… will cause
    • are on BSD/Linux or OS X (I didn't test the following on Windows)
    • don't mind a using a full filename path
    • are comfortable using the inline package

    then the following should be about as fast as you can get (it's pretty much the 'line count' portion of wc in an inline R C function):

    library(inline)
    
    wc.code <- "
    uintmax_t linect = 0; 
    uintmax_t tlinect = 0;
    
    int fd, len;
    u_char *p;
    
    struct statfs fsb;
    
    static off_t buf_size = SMALL_BUF_SIZE;
    static u_char small_buf[SMALL_BUF_SIZE];
    static u_char *buf = small_buf;
    
    PROTECT(f = AS_CHARACTER(f));
    
    if ((fd = open(CHAR(STRING_ELT(f, 0)), O_RDONLY, 0)) >= 0) {
    
      if (fstatfs(fd, &fsb)) {
        fsb.f_iosize = SMALL_BUF_SIZE;
      }
    
      if (fsb.f_iosize != buf_size) {
        if (buf != small_buf) {
          free(buf);
        }
        if (fsb.f_iosize == SMALL_BUF_SIZE || !(buf = malloc(fsb.f_iosize))) {
          buf = small_buf;
          buf_size = SMALL_BUF_SIZE;
        } else {
          buf_size = fsb.f_iosize;
        }
      }
    
      while ((len = read(fd, buf, buf_size))) {
    
        if (len == -1) {
          (void)close(fd);
          break;
        }
    
        for (p = buf; len--; ++p)
          if (*p == '\\n')
            ++linect;
      }
    
      tlinect += linect;
    
      (void)close(fd);
    
    }
    SEXP result;
    PROTECT(result = NEW_INTEGER(1));
    INTEGER(result)[0] = tlinect;
    UNPROTECT(2);
    return(result);
    ";
    
    setCMethod("wc",
               signature(f="character"), 
               wc.code,
               includes=c("#include <stdlib.h>", 
                          "#include <stdio.h>",
                          "#include <sys/param.h>",
                          "#include <sys/mount.h>",
                          "#include <sys/stat.h>",
                          "#include <ctype.h>",
                          "#include <err.h>",
                          "#include <errno.h>",
                          "#include <fcntl.h>",
                          "#include <locale.h>",
                          "#include <stdint.h>",
                          "#include <string.h>",
                          "#include <unistd.h>",
                          "#include <wchar.h>",
                          "#include <wctype.h>",
                          "#define SMALL_BUF_SIZE (1024 * 8)"),
               language="C",
               convention=".Call")
    
    wc("FULLPATHTOFILE")
    

    It'd be better as a package since it actually has to compile the first time through. But, it's here for reference if you really do need "speed". For a 189,955 line file I had lying around, I get (mean values from a bunch of runs):

       user  system elapsed 
      0.007   0.003   0.010 
    
    0 讨论(0)
  • 2020-12-05 11:15

    I found this easy way using R.utils package

    library(R.utils)
    sapply(myfiles,countLines)
    

    here is how it works

    0 讨论(0)
  • 2020-12-05 11:16

    If you are using linux, this might work for you:

    # total lines on a file through system call to wc, and filtering with awk
    target_file   <- "your_file_name_here"
    total_records <- as.integer(system2("wc",
                                        args = c("-l",
                                                 target_file,
                                                 " | awk '{print $1}'"),
                                        stdout = TRUE))
    

    in your case:

    #
    lapply(myfiles, function(x){
                             as.integer(system2("wc",
                                                args = c("-l",
                                                         x,
                                                         " | awk '{print $1}'"),
                                                stdout = TRUE))
                          }
                      )
    
    0 讨论(0)
  • 2020-12-05 11:20

    Maybe I am missing something but usually I do it using length on top of ReadLines:

    con <- file("some_file.format") 
    length(readLines(con))
    

    This at least has worked with many cases I had. I think it's kinda fast and it does only create a connection to the file without importing it.

    0 讨论(0)
提交回复
热议问题