How can I make this PowerShell script parse large files faster?

后端 未结 3 619
名媛妹妹
名媛妹妹 2020-12-01 12:58

I have the following PowerShell script that will parse some very large file for ETL purposes. For starters my test file is ~ 30 MB. Larger files around 200 MB are

相关标签:
3条回答
  • 2020-12-01 13:33

    Your script reads one line at a time (slow!) and stores almost the entire file in memory (big!).

    Try this (not tested extensively):

    $path = "E:\Documents\Projects\ESPS\Dev\DataFiles\DimProductionOrderOperation"
    $infile = "14SEP11_ProdOrderOperations.txt"
    $outfile = "PROCESSED_14SEP11_ProdOrderOperations.txt"
    
    $batch = 1000
    
    [regex]$match_regex = '^\|.+\|.+\|.+'
    [regex]$replace_regex = '^\|(.+)\|$'
    
    $header_line = (Select-String -Path $path\$infile -Pattern $match_regex -list).line
    
    [regex]$header_regex = [regex]::escape($header_line)
    
    $header_line.trim('|') | Set-Content $path\$outfile
    
    Get-Content $path\$infile -ReadCount $batch |
        ForEach {
                 $_ -match $match_regex -NotMatch $header_regex -Replace $replace_regex ,'$1' | Out-File $path\$outfile -Append
        }
    

    That's a compromise between memory usage and speed. The -match and -replace operators will work on an array, so you can filter and replace an entire array at once without having to foreach through every record. The -readcount will cause the file to be read in chunks of $batch records, so you're basically reading in 1000 records at a time, doing the match and replace on that batch then appending the result to your output file. Then it goes back for the next 1000 records. Increasing the size of $batch should speed it up, but it will make it use more memory. Adjust that to suit your resources.

    0 讨论(0)
  • 2020-12-01 13:33

    The Get-Content cmdlet does not perform as well as a StreamReader when dealing with very large files. You can read a file line by line using a StreamReader like this:

    $path = 'C:\A-Very-Large-File.txt'
    $r = [IO.File]::OpenText($path)
    while ($r.Peek() -ge 0) {
        $line = $r.ReadLine()
        # Process $line here...
    }
    $r.Dispose()
    

    Some performance comparisons:

    Measure-Command {Get-Content .\512MB.txt > $null}
    

    Total Seconds: 49.4742533

    Measure-Command {
        $r = [IO.File]::OpenText('512MB.txt')
        while ($r.Peek() -ge 0) {
            $r.ReadLine() > $null
        }
        $r.Dispose()
    }
    

    Total Seconds: 27.666803

    0 讨论(0)
  • 2020-12-01 13:46

    This is almost a non-answer...I love PowerShell...but I will not use it to parse log files, especially large log files. Use Microsoft's Log Parser.

    C:\>type input.txt | logparser "select substr(field1,1) from STDIN" -i:TSV -nskiplines:14 -headerrow:off -iseparator:spaces -o:tsv -headers:off -stats:off
    
    0 讨论(0)
提交回复
热议问题