Is there a way to get the number of lines in a file without importing it?
So far this is what I am doing
myfiles <- list.files(pattern=\"*.dat\")
You can count the number of newline characters (\n
, will also work for \r\n
on Windows) in a file. This will give you a correct answer iff:
read.csv
gives a warning if this doesn't hold)I'll suffice to read the file in parts. Below I set chunk (tmp buf) size of 65536 bytes:
f <- file("filename.csv", open="rb")
nlines <- 0L
while (length(chunk <- readBin(f, "raw", 65536)) > 0) {
nlines <- nlines + sum(chunk == as.raw(10L))
}
print(nlines)
close(f)
Benchmarks on a ca. 512 MB ASCII text file, 12101000 text lines, Linux:
readBin
: ca. 2.4 s.
@luis_js's wc
-based solution: 0.1 s.
read.delim
: 39.6 s.
EDIT: reading a file line by line with readLines
(f <- file("/tmp/test.txt", open="r"); nlines <- 0L; while (length(l <- readLines(f, 128)) > 0) nlines <- nlines + length(l); close(f)
): 32.0 s.
If you:
system2("wc"…
will causeinline
packagethen the following should be about as fast as you can get (it's pretty much the 'line count' portion of wc
in an inline R C function):
library(inline)
wc.code <- "
uintmax_t linect = 0;
uintmax_t tlinect = 0;
int fd, len;
u_char *p;
struct statfs fsb;
static off_t buf_size = SMALL_BUF_SIZE;
static u_char small_buf[SMALL_BUF_SIZE];
static u_char *buf = small_buf;
PROTECT(f = AS_CHARACTER(f));
if ((fd = open(CHAR(STRING_ELT(f, 0)), O_RDONLY, 0)) >= 0) {
if (fstatfs(fd, &fsb)) {
fsb.f_iosize = SMALL_BUF_SIZE;
}
if (fsb.f_iosize != buf_size) {
if (buf != small_buf) {
free(buf);
}
if (fsb.f_iosize == SMALL_BUF_SIZE || !(buf = malloc(fsb.f_iosize))) {
buf = small_buf;
buf_size = SMALL_BUF_SIZE;
} else {
buf_size = fsb.f_iosize;
}
}
while ((len = read(fd, buf, buf_size))) {
if (len == -1) {
(void)close(fd);
break;
}
for (p = buf; len--; ++p)
if (*p == '\\n')
++linect;
}
tlinect += linect;
(void)close(fd);
}
SEXP result;
PROTECT(result = NEW_INTEGER(1));
INTEGER(result)[0] = tlinect;
UNPROTECT(2);
return(result);
";
setCMethod("wc",
signature(f="character"),
wc.code,
includes=c("#include <stdlib.h>",
"#include <stdio.h>",
"#include <sys/param.h>",
"#include <sys/mount.h>",
"#include <sys/stat.h>",
"#include <ctype.h>",
"#include <err.h>",
"#include <errno.h>",
"#include <fcntl.h>",
"#include <locale.h>",
"#include <stdint.h>",
"#include <string.h>",
"#include <unistd.h>",
"#include <wchar.h>",
"#include <wctype.h>",
"#define SMALL_BUF_SIZE (1024 * 8)"),
language="C",
convention=".Call")
wc("FULLPATHTOFILE")
It'd be better as a package since it actually has to compile the first time through. But, it's here for reference if you really do need "speed". For a 189,955
line file I had lying around, I get (mean values from a bunch of runs):
user system elapsed
0.007 0.003 0.010
I found this easy way using R.utils package
library(R.utils)
sapply(myfiles,countLines)
here is how it works
If you are using linux, this might work for you:
# total lines on a file through system call to wc, and filtering with awk
target_file <- "your_file_name_here"
total_records <- as.integer(system2("wc",
args = c("-l",
target_file,
" | awk '{print $1}'"),
stdout = TRUE))
in your case:
#
lapply(myfiles, function(x){
as.integer(system2("wc",
args = c("-l",
x,
" | awk '{print $1}'"),
stdout = TRUE))
}
)
Maybe I am missing something but usually I do it using length on top of ReadLines:
con <- file("some_file.format")
length(readLines(con))
This at least has worked with many cases I had. I think it's kinda fast and it does only create a connection to the file without importing it.