Memory exception while filtering large CSV files

◇◆丶佛笑我妖孽 提交于 2020-04-06 23:25:47

问题


getting memory exception while running this code. Is there a way to filter one file at a time and write output and append after processing each file. Seems the below code loads everything to memory.

$inputFolder = "C:\Change\2019\October"
$outputFile = "C:\Change\2019\output.csv"
Get-ChildItem $inputFolder -File -Filter '*.csv' |
    ForEach-Object { Import-Csv $_.FullName } |
    Where-Object { $_.machine_type -eq 'workstations' } |
    Export-Csv $outputFile -NoType


回答1:


May be can you export and filter your files one by one and append result into your output file like this :

$inputFolder = "C:\Change\2019\October"
$outputFile = "C:\Change\2019\output.csv"

Remove-Item $outputFile -Force -ErrorAction SilentlyContinue

Get-ChildItem $inputFolder -Filter "*.csv" -file | %{import-csv $_.FullName | where machine_type -eq 'workstations' | export-csv $outputFile -Append -notype }



回答2:


Note: The reason for not using Get-ChildItem ... | Import-Csv ... - i.e., for not directly piping Get-ChildItem to Import-Csv and instead having to call Import-Csv from the script block ({ ... } of an auxiliary ForEach-Object call, is a bug in Windows PowerShell that has since been fixed in PowerShell Core - see the bottom section for a more concise workaround.

However, even output from ForEach-Object script blocks should stream to the remaining pipeline commands, so you shouldn't run out of memory - after all, a salient feature of the PowerShell pipeline is object-by-object processing, which keeps memory use constant, irrespective of the size of the (streaming) input collection.

You've since confirmed that avoiding the aux. ForEach-Object call does not solve the problem, so we still don't know what causes your out-of-memory exception.

Update:

  • This GitHub issue contains clues as to the reason for excessive memory use, especially with many properties that contain small amounts of data.

  • This GitHub feature request proposes using strongly typed output objects to help the issue.

The following workaround, which uses the switch statement to process the files as text files, may help:

$header = ''
Get-ChildItem $inputFolder -Filter *.csv | ForEach-Object {
  $i = 0
  switch -Wildcard -File $_.FullName {
    '*workstations*' {
      # NOTE: If no other columns contain the word `workstations`, you can 
      # simplify and speed up the command by omitting the `ConvertFrom-Csv` call 
      # (you can make the wildcard matching more robust with something 
      # like '*,workstations,*')
      if ((ConvertFrom-Csv "$header`n$_").machine_type -ne 'workstations') { continue }
      $_ # row whose 'machine_type' column value equals 'workstations'
    }
    default {
      if ($i++ -eq 0) {
        if ($header) { continue } # header already written
        else { $header = $_; $_ } # header row of 1st file
      }
    }
  }
} | Set-Content $outputFile

Here's a workaround for the bug of not being able to pipe Get-ChildItem output directly to Import-Csv, by passing it as an argument instead:

Import-Csv -LiteralPath (Get-ChildItem $inputFolder -File -Filter *.csv) |
    Where-Object { $_.machine_type -eq 'workstations' } |
    Export-Csv $outputFile -NoType

Note that in PowerShell Core you could more naturally write:

Get-ChildItem $inputFolder -File -Filter *.csv | Import-Csv |
  Where-Object { $_.machine_type -eq 'workstations' } |
    Export-Csv $outputFile -NoType



回答3:


Solution 2 :

$inputFolder = "C:\Change\2019\October"
$outputFile = "C:\Change\2019\output.csv"
$encoding = [System.Text.Encoding]::UTF8  # modify encoding if necessary
$Delimiter=','

#find header for your files => i take first row of first file with data
$Header = Get-ChildItem -Path $inputFolder -Filter *.csv | Where length -gt 0 | select -First 1 | Get-Content -TotalCount 1

#if not header founded then not file with sise >0 => we quit
if(! $Header) {return}

#create array for header
$HeaderArray=$Header -split $Delimiter -replace '"', ''

#open output file
$w = New-Object System.IO.StreamWriter($outputfile, $true, $encoding)

#write header founded
$w.WriteLine($Header)


#loop on file csv
Get-ChildItem $inputFolder -File -Filter "*.csv" | %{

    #open file for read
    $r = New-Object System.IO.StreamReader($_.fullname, $encoding)
    $skiprow = $true

    while ($line = $r.ReadLine()) 
    {
        #exclude header
        if ($skiprow) 
        {
            $skiprow = $false
            continue
        }

        #Get objet for current row with header founded
        $Object=$line | ConvertFrom-Csv -Header $HeaderArray -Delimiter $Delimiter

        #write in output file for your condition asked
        if ($Object.machine_type -eq 'workstations') { $w.WriteLine($line) }

    }

    $r.Close()
    $r.Dispose()

}

$w.close()
$w.Dispose()



回答4:


You have to read and write to the .csv files one row at a time, using StreamReader and StreamWriter:

$filepath = "C:\Change\2019\October"
$outputfile = "C:\Change\2019\output.csv"
$encoding = [System.Text.Encoding]::UTF8

$files = Get-ChildItem -Path $filePath -Filter *.csv | 
         Where-Object { $_.machine_type -eq 'workstations' }

$w = New-Object System.IO.StreamWriter($outputfile, $true, $encoding)

$skiprow = $false
foreach ($file in $files)
{
    $r = New-Object System.IO.StreamReader($file.fullname, $encoding)
    while (($line = $r.ReadLine()) -ne $null) 
    {
        if (!$skiprow)
        {
            $w.WriteLine($line)
        }
        $skiprow = $false
    }
    $r.Close()
    $r.Dispose()
    $skiprow = $true
}

$w.close()
$w.Dispose()



回答5:


get-content *.csv | add-content combined.csv

Make sure combined.csv doesn't exist when you run this, or it's going to go full Ouroboros.



来源:https://stackoverflow.com/questions/58660818/memory-exception-while-filtering-large-csv-files

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!