One of the ways to get number of lines from a file is this method in PowerShell:
PS C:\\Users\\Pranav\\Desktop\\PS_Test_Scripts> $a=Get-Content .\\sub.ps1
The first thing to try is to stream Get-Content
and build up the line count one at a time, rather that storing all lines in an array at once. I think that this will give proper streaming behavior - i.e. the entire file will not be in memory at once, just the current line.
$lines = 0
Get-Content .\File.txt |%{ $lines++ }
And as the other answer suggests, adding -ReadCount
could speed this up.
If that doesn't work for you (too slow or too much memory) you could go directly to a StreamReader
:
$count = 0
$reader = New-Object IO.StreamReader 'c:\logs\MyLog.txt'
while($reader.ReadLine() -ne $null){ $count++ }
$reader.Close() # Don't forget to do this. Ideally put this in a try/finally block to make sure it happens.
Here's another solution that uses .NET:
[Linq.Enumerable]::Count([System.IO.File]::ReadLines("FileToCount.txt"))
It's not very interruptible, but it's very easy on memory.
For some of my huge files (GB+), SWITCH was faster and easy on the memory.
Note: Timing below is in minutes:seconds. Testing was on a file with 14,564,836 lines, each 906 characters long.
1:27 SWITCH
$count = 0; switch -File $filepath { default { ++$count } }
1:39 IO.StreamReader
$reader = New-Object IO.StreamReader $filepath
while($reader.ReadLine() -ne $null){ $count++ }
1:42 Linq
$count = [Linq.Enumerable]::Count([System.IO.File]::ReadLines($filepath))
1:46 Get-Content based
$filepath |% {$file_line_count = 0; Get-Content -Path $_ -ReadCount 1000 |% { $file_line_count += $_.Count }}
If you have optimizations for any of the methods or other approaches you've found to be faster, please share.
Here's a PowerShell script I cobbled together which demonstrates a few different methods of counting lines in a text file, along with the time and memory required for each method. The results (below) show clear differences in the time and memory requirements. For my tests, it looks like the sweet spot was Get-Content, using a ReadCount setting of 100. The other tests required significantly more time and/or memory usage.
#$testFile = 'C:\test_small.csv' # 245 lines, 150 KB
#$testFile = 'C:\test_medium.csv' # 95,365 lines, 104 MB
$testFile = 'C:\test_large.csv' # 285,776 lines, 308 MB
# Using ArrayList just because they are faster than Powershell arrays, for some operations with large arrays.
$results = New-Object System.Collections.ArrayList
function AddResult {
param( [string] $sMethod, [string] $iCount )
$result = New-Object -TypeName PSObject -Property @{
"Method" = $sMethod
"Count" = $iCount
"Elapsed Time" = ((Get-Date) - $dtStart)
"Memory Total" = [System.Math]::Round((GetMemoryUsage)/1mb, 1)
"Memory Delta" = [System.Math]::Round(((GetMemoryUsage) - $dMemStart)/1mb, 1)
}
[void]$results.Add($result)
Write-Output "$sMethod : $count"
[System.GC]::Collect()
}
function GetMemoryUsage {
# return ((Get-Process -Id $pid).PrivateMemorySize)
return ([System.GC]::GetTotalMemory($false))
}
# Get-Content -ReadCount 1
[System.GC]::Collect()
$dMemStart = GetMemoryUsage
$dtStart = Get-Date
$count = 0
Get-Content -Path $testFile -ReadCount 1 |% { $count++ }
AddResult "Get-Content -ReadCount 1" $count
# Get-Content -ReadCount 10,100,1000,0
# Note: ReadCount = 1 returns a string. Any other value returns an array of strings.
# Thus, the Count property only applies when ReadCount is not 1.
@(10,100,1000,0) |% {
$dMemStart = GetMemoryUsage
$dtStart = Get-Date
$count = 0
Get-Content -Path $testFile -ReadCount $_ |% { $count += $_.Count }
AddResult "Get-Content -ReadCount $_" $count
}
# Get-Content | Measure-Object
$dMemStart = GetMemoryUsage
$dtStart = Get-Date
$count = (Get-Content -Path $testFile -ReadCount 1 | Measure-Object -line).Lines
AddResult "Get-Content -ReadCount 1 | Measure-Object" $count
# Get-Content.Count
$dMemStart = GetMemoryUsage
$dtStart = Get-Date
$count = (Get-Content -Path $testFile -ReadCount 1).Count
AddResult "Get-Content.Count" $count
# StreamReader.ReadLine
$dMemStart = GetMemoryUsage
$dtStart = Get-Date
$count = 0
# Use this constructor to avoid file access errors, like Get-Content does.
$stream = New-Object -TypeName System.IO.FileStream(
$testFile,
[System.IO.FileMode]::Open,
[System.IO.FileAccess]::Read,
[System.IO.FileShare]::ReadWrite)
if ($stream) {
$reader = New-Object IO.StreamReader $stream
if ($reader) {
while(-not ($reader.EndOfStream)) { [void]$reader.ReadLine(); $count++ }
$reader.Close()
}
$stream.Close()
}
AddResult "StreamReader.ReadLine" $count
$results | Select Method, Count, "Elapsed Time", "Memory Total", "Memory Delta" | ft -auto | Write-Output
Here are results for text file containing ~95k lines, 104 MB:
Method Count Elapsed Time Memory Total Memory Delta
------ ----- ------------ ------------ ------------
Get-Content -ReadCount 1 95365 00:00:11.1451841 45.8 0.2
Get-Content -ReadCount 10 95365 00:00:02.9015023 47.3 1.7
Get-Content -ReadCount 100 95365 00:00:01.4522507 59.9 14.3
Get-Content -ReadCount 1000 95365 00:00:01.1539634 75.4 29.7
Get-Content -ReadCount 0 95365 00:00:01.3888746 346 300.4
Get-Content -ReadCount 1 | Measure-Object 95365 00:00:08.6867159 46.2 0.6
Get-Content.Count 95365 00:00:03.0574433 465.8 420.1
StreamReader.ReadLine 95365 00:00:02.5740262 46.2 0.6
Here are results for a larger file (containing ~285k lines, 308 MB):
Method Count Elapsed Time Memory Total Memory Delta
------ ----- ------------ ------------ ------------
Get-Content -ReadCount 1 285776 00:00:36.2280995 46.3 0.8
Get-Content -ReadCount 10 285776 00:00:06.3486006 46.3 0.7
Get-Content -ReadCount 100 285776 00:00:03.1590055 55.1 9.5
Get-Content -ReadCount 1000 285776 00:00:02.8381262 88.1 42.4
Get-Content -ReadCount 0 285776 00:00:29.4240734 894.5 848.8
Get-Content -ReadCount 1 | Measure-Object 285776 00:00:32.7905971 46.5 0.9
Get-Content.Count 285776 00:00:28.4504388 1219.8 1174.2
StreamReader.ReadLine 285776 00:00:20.4495721 46 0.4
Here is a one-liner based on Pseudothink's post.
Rows in one specific file:
"the_name_of_your_file.txt" |% {$n = $_; $c = 0; Get-Content -Path $_ -ReadCount 1000 |% { $c += $_.Count }; "$n; $c"}
All files in current dir (individually):
Get-ChildItem "." |% {$n = $_; $c = 0; Get-Content -Path $_ -ReadCount 1000 |% { $c += $_.Count }; "$n; $c"}
Explanation:
"the_name_of_your_file.txt"
-> does nothing, just provides the filename for next steps, needs to be double quoted
|%
-> alias ForEach-Object, iterates over items provided (just one in this case), accepts piped content as an input, current item saved to $_
$n = $_
-> $n as name of the file provided is saved for later from $_
, actually this may not be needed
$c = 0
-> initialisation of $c
as count
Get-Content -Path $_ -ReadCount 1000
-> read 1000 lines from file provided (see other answers of the thread)
|%
-> foreach do add numbers of rows actually read to $c
(will be like 1000 + 1000 + 123)
"$n; $c"
-> once finished reading file, print name of file; count of rows
Get-ChildItem "."
-> just adds more items to the pipe than single filename did
Here is something I wrote to trying lessening the memory usage when parsing out the white-space in my txt file. With that said, the memory usage still get kind of high, but the process take less time to run.
Just to give you some background of my file, the file had over 2 millions records and have leading white space in both front and rear of the each line. I believe total time was 5+ minutes.
$testing = 'C:\Users\something\something\test3.txt'
$filecleanup = Get-ChildItem $testing
foreach ($file in $filecleanup)
{
$file1 = Get-Content $file -readcount 1000 | foreach{$_.Trim()}
$file1 > $filecleanup
}