PowerShell is slow (much slower than Python) in large Search/Replace operation?

前端 未结 5 1214
长情又很酷
长情又很酷 2021-02-02 12:41

I have 265 CSV files with over 4 million total records (lines), and need to do a search and replace in all the CSV files. I have a snippet of my PowerShell code below that does

相关标签:
5条回答
  • 2021-02-02 12:50

    Give this PowerShell script a try. It should perform much better. Much less use of RAM too as the file is read in a buffered stream.

    $reader = [IO.File]::OpenText("C:\input.csv")
    $writer = New-Object System.IO.StreamWriter("C:\output.csv")
    
    while ($reader.Peek() -ge 0) {
        $line = $reader.ReadLine()
        $line2 = $line -replace $SearchStr, $ReplaceStr
        $writer.writeline($line2)
    }
    
    $reader.Close()
    $writer.Close()
    

    This processes one file, but you can test performance with it and if its more acceptable add it to a loop.

    Alternatively you can use Get-Content to read a number of lines into memory, perform the replacement and then write the updated chunk utilizing the PowerShell pipeline.

    Get-Content "C:\input.csv" -ReadCount 512 | % {
        $_ -replace $SearchStr, $ReplaceStr
    } | Set-Content "C:\output.csv"
    

    To squeeze a little more performance you can also compile the regex (-replace uses regular expressions) like this:

    $re = New-Object Regex $SearchStr, 'Compiled'
    $re.Replace( $_ , $ReplaceStr )
    
    0 讨论(0)
  • 2021-02-02 13:00

    Actually, I'm faced with a similar issue right now. With my new job, i have to parse huge text files to pull information based on certain criteria. The powershell script (optimized to the brim) takes 4 hours to return a fully processed csv file. We wrote another python script that took just under 1 hour...

    As much as i love powershell, i was heart broken. For your amusement, try this: Powershell:

    $num = 0
    $string = "Mary had a little lamb"
    
    while($num -lt 1000000){
        $string = $string.ToUpper()
        $string = $string.ToLower()
        Write-Host $string
        $num++
    }
    

    Python:

    num = 0
    string = "Mary had a little lamb"
    
    while num < 1000000:
        string = string.lower()
        string = string.upper()
        print(string)
        num+=1
    

    and trigger the two jobs. You can even encapsulate in Measure-command{} to keep it "scientific".

    Also, link, crazy read..

    0 讨论(0)
  • 2021-02-02 13:03

    I see this a lot:

    $content | foreach {$_ -replace $SearchStr, $ReplaceStr} 
    

    The -replace operator will handle an entire array at once:

    $content -replace $SearchStr, $ReplaceStr
    

    and do it a lot faster than iterating through one element at a time. I suspect doing that may get you closer to an apples-to-apples comparison.

    0 讨论(0)
  • 2021-02-02 13:17

    I don't know Python, but it looks like you are doing literal string replacements in the Python script. In Powershell, the -replace operator is a regular expression search/replace. I would convert the Powershell to using the replace method on the string class (or to answer the original question, I think your Powershell is inefficient).

    ForEach ($file in Get-ChildItem C:\temp\csv\*.csv) 
    {
        $content = Get-Content -path $file
        # look close, not much changes
        $content | foreach {$_.Replace($SearchStr, $ReplaceStr)} | Set-Content $file
    }
    

    EDIT Upon further review, I think I see another (perhaps more important) difference in the versions. The Python version appears to be reading the entire file into a single string. The Powershell version on the other hand is reading into an array of strings.

    The help on Get-Content mentions a ReadCount parameter that can affect the performance. Setting this count to -1 seems to read the entire file into a single array. This will mean that you are passing an array through the pipeline instead of individual strings, but a simple change to the code will deal with that:

    # $content is now an array
    $content | % { $_ } | % {$_.Replace($SearchStr, $ReplaceStr)} | Set-Content $file
    

    If you want to read the entire file into a single string like the Python version seems to, just call the .NET method directly:

    # now you have to make sure to use a FULL RESOLVED PATH
    $content = [System.IO.File]::ReadAllText($file.FullName) 
    $content.Replace($SearchStr, $ReplaceStr) | Set-Content $file
    

    This is not quite as "Powershell-y" since you use the .NET APIs directly instead of the similar cmdlets, but they put the ability in there for times when you need it.

    0 讨论(0)
  • 2021-02-02 13:17

    You may want to try the following command:

    gci C:\temp\csv\*.csv | % { (gc $_) -replace $SearchStr, $ReplaceStr | out-file $_}
    

    In addition, some strings may require escape characters, hence you should use [regex]Escape to generate strings with escape characters built in. The code would look like:

    gci C:\temp\csv\*.csv | % { (gc $_) -replace $([regex]::Escape($SearchStr)) $([regex]::Escape($ReplaceStr)) | out-file $_}
    
    0 讨论(0)
提交回复
热议问题