I have 265 CSV files with over 4 million total records (lines), and need to do a search and replace in all the CSV files. I have a snippet of my PowerShell code below that does
Give this PowerShell script a try. It should perform much better. Much less use of RAM too as the file is read in a buffered stream.
$reader = [IO.File]::OpenText("C:\input.csv")
$writer = New-Object System.IO.StreamWriter("C:\output.csv")
while ($reader.Peek() -ge 0) {
$line = $reader.ReadLine()
$line2 = $line -replace $SearchStr, $ReplaceStr
$writer.writeline($line2)
}
$reader.Close()
$writer.Close()
This processes one file, but you can test performance with it and if its more acceptable add it to a loop.
Alternatively you can use Get-Content
to read a number of lines into memory, perform the replacement and then write the updated chunk utilizing the PowerShell pipeline.
Get-Content "C:\input.csv" -ReadCount 512 | % {
$_ -replace $SearchStr, $ReplaceStr
} | Set-Content "C:\output.csv"
To squeeze a little more performance you can also compile the regex (-replace
uses regular expressions) like this:
$re = New-Object Regex $SearchStr, 'Compiled'
$re.Replace( $_ , $ReplaceStr )
Actually, I'm faced with a similar issue right now. With my new job, i have to parse huge text files to pull information based on certain criteria. The powershell script (optimized to the brim) takes 4 hours to return a fully processed csv file. We wrote another python script that took just under 1 hour...
As much as i love powershell, i was heart broken. For your amusement, try this: Powershell:
$num = 0
$string = "Mary had a little lamb"
while($num -lt 1000000){
$string = $string.ToUpper()
$string = $string.ToLower()
Write-Host $string
$num++
}
Python:
num = 0
string = "Mary had a little lamb"
while num < 1000000:
string = string.lower()
string = string.upper()
print(string)
num+=1
and trigger the two jobs. You can even encapsulate in Measure-command{} to keep it "scientific".
Also, link, crazy read..
I see this a lot:
$content | foreach {$_ -replace $SearchStr, $ReplaceStr}
The -replace operator will handle an entire array at once:
$content -replace $SearchStr, $ReplaceStr
and do it a lot faster than iterating through one element at a time. I suspect doing that may get you closer to an apples-to-apples comparison.
I don't know Python, but it looks like you are doing literal string replacements in the Python script. In Powershell, the -replace
operator is a regular expression search/replace. I would convert the Powershell to using the replace method on the string class (or to answer the original question, I think your Powershell is inefficient).
ForEach ($file in Get-ChildItem C:\temp\csv\*.csv)
{
$content = Get-Content -path $file
# look close, not much changes
$content | foreach {$_.Replace($SearchStr, $ReplaceStr)} | Set-Content $file
}
EDIT Upon further review, I think I see another (perhaps more important) difference in the versions. The Python version appears to be reading the entire file into a single string. The Powershell version on the other hand is reading into an array of strings.
The help on Get-Content
mentions a ReadCount
parameter that can affect the performance. Setting this count to -1 seems to read the entire file into a single array. This will mean that you are passing an array through the pipeline instead of individual strings, but a simple change to the code will deal with that:
# $content is now an array
$content | % { $_ } | % {$_.Replace($SearchStr, $ReplaceStr)} | Set-Content $file
If you want to read the entire file into a single string like the Python version seems to, just call the .NET method directly:
# now you have to make sure to use a FULL RESOLVED PATH
$content = [System.IO.File]::ReadAllText($file.FullName)
$content.Replace($SearchStr, $ReplaceStr) | Set-Content $file
This is not quite as "Powershell-y" since you use the .NET APIs directly instead of the similar cmdlets, but they put the ability in there for times when you need it.
You may want to try the following command:
gci C:\temp\csv\*.csv | % { (gc $_) -replace $SearchStr, $ReplaceStr | out-file $_}
In addition, some strings may require escape characters, hence you should use [regex]Escape to generate strings with escape characters built in. The code would look like:
gci C:\temp\csv\*.csv | % { (gc $_) -replace $([regex]::Escape($SearchStr)) $([regex]::Escape($ReplaceStr)) | out-file $_}