Is it possible to get the number of rows in a CSV file without opening it?

前端未结

关注

 4  1519

I have a CSV file of size ~1 GB, and as my laptop is of basic configuration, I\'m not able to open the file in Excel or R. But out of curiosity, I would like to get the numb

相关标签:

4条回答

攒了一身酷

2020-12-03 10:27
Here is something I used:
```
testcon <- file("xyzfile.csv",open="r")
readsizeof <- 20000
nooflines <- 0
( while((linesread <- length(readLines(testcon,readsizeof))) > 0 ) 
nooflines <- nooflines+linesread )
close(testcon)
nooflines
```
Check out this post for more: https://www.r-bloggers.com/easy-way-of-determining-number-of-linesrecords-in-a-given-large-file-using-r/
0 讨论(0)
发布评论:

提交评论
- 加载中...
我寻月下人不归

2020-12-03 10:33
Estimate number of lines based on size of first 1000 lines
```
size1000  <- sum(nchar(readLines(con = "dgrp2.tgeno", n = 1000)))

sizetotal <- file.size("dgrp2.tgeno")
1000 *  sizetotal / size1000
```
This is usually good enough for most purposes - and is a lot faster for huge files.
0 讨论(0)
发布评论:

提交评论
- 加载中...
[愿得一人]

2020-12-03 10:38
For Linux/Unix:
```
wc -l filename
```
For Windows:
```
find /c /v "A String that is extremely unlikely to occur" filename
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
抹茶落季

2020-12-03 10:40
Option 1:

Through a file connection, count.fields() counts the number of fields per line of the file based on some sep value (that we don't care about here). So if we take the length of that result, theoretically we should end up with the number of lines (and rows) in the file.
```
length(count.fields(filename))
```
If you have a header row, you can skip it with skip = 1
```
length(count.fields(filename, skip = 1))
```
There are other arguments that you can adjust for your specific needs, like skipping blank lines.
```
args(count.fields)
# function (file, sep = "", quote = "\"'", skip = 0, blank.lines.skip = TRUE, 
#     comment.char = "#") 
# NULL
```
See help(count.fields) for more.

It's not too bad as far as speed goes. I tested it on one of my baseball files that contains 99846 rows.
```
nrow(data.table::fread("Batting.csv"))
# [1] 99846

system.time({ l <- length(count.fields("Batting.csv", skip = 1)) })
#   user  system elapsed 
#  0.528   0.000   0.503 

l
# [1] 99846
file.info("Batting.csv")$size
# [1] 6153740
```
(The more efficient) Option 2: Another idea is to use data.table::fread() to read the first column only, then take the number of rows. This would be very fast.
```
system.time(nrow(fread("Batting.csv", select = 1L)))
#   user  system elapsed 
#  0.063   0.000   0.063 
```
0 讨论(0)
发布评论:

提交评论
- 加载中...