More-efficient way to modify a CSV file's content

问题

I'm attempting to remove some of the detritus that SSMS 2012 generates when a query's results are exported as CSV.

For example, it includes the word 'NULL' for null values and adds milliseconds to datetime values:

DATE_COLUMN,DATETIME_COLUMN,TEXT_COLUMN,NUMBER_COLUMN
2015-05-01,2015-05-01 23:00:00.000,LOREM IPSUM,10.3456
NULL,NULL,NULL,0

Unfortunately, Excel doesn't automatically format datetime values with fractional seconds correctly, which lead to confusion amongst the customers ('What happened to the date field that I requested?') and more work for me (having to convert the CSV to XLSX and format the columns correctly prior to distribution).

The goal is to strip the CSV file of NULL and .000 values:

DATE_COLUMN,DATETIME_COLUMN,TEXT_COLUMN,NUMBER_COLUMN
2015-05-01,2015-05-01 23:00:00,LOREM IPSUM,10.3456
,,,0

Excel will open this file and format it properly without further technical assistance.

To that end, I wrote:

Function Invoke-CsvCleanser {

  [CmdletBinding()]
  Param(
    [parameter(Mandatory=$true)]
    [String]
    $Path,
    [switch]
    $Nulls,
    [switch]
    $Milliseconds
  )

  PROCESS {

    # open the file
    $data = Import-Csv $path

    # process each row
    $data | Foreach-Object { 

        # process each column
        Foreach ($property in $_.PSObject.Properties) {

            # if column contains 'NULL', replace it with ''
            if ($Nulls -and ($property.Value -eq 'NULL')) {
                $property.Value = $property.Value -replace 'NULL', ''
            }

            # if column contains a date/time value, remove milliseconds
            elseif ( $Milliseconds -and (isDate($property.Value)) ) {
                $property.Value = $property.Value -replace '.000', ''    
            }
        } 

    } 

    # save file
    $data | Export-Csv -Path $Path -NoTypeInformation

  }

}

function IsDate($object) {
    [Boolean]($object -as [DateTime])
}

PS> Invoke-CsvCleanser 'C:\Users\Foobar\Desktop\0000.csv' -Nulls -Milliseconds

This works fine when the file size is small, but is quite inefficient for large files. Ideally, Invoke-CsvCleanser would make use of the pipeline.

Is there a better way to do this?

回答1:

Import-CSV always loads entire file in memory, so it's slow. Here is modified script from my answer to this question: CSV formatting - strip qualifier from specific fields.

It uses raw file processing, so it should be significantly faster. NULLs and milliseconds are matched\replaced using regex. Script is able to mass-convert CSV's.

Regex to split CSV is from this question: How to split a string by comma ignoring comma in double quotes

Save this script as Invoke-CsvCleanser.ps1. It accepts following arguments:

InPath: folder to read CSVs from. If not specified, the current directory is used.
OutPath: folder to save processed CSVs to. Will be created, if not exist.
Encoding: If not specified, script will use system's current ANSI code page to read the files. You can get other valid encodings for your system in PowerShell console like this: [System.Text.Encoding]::GetEncodings()
DoubleQuotes: switch, if specified, surrounding double quotes will be stripped from values
Nulls: switch, if specified, NULL strings will be stripped from values
Milliseconds: switch, if specified, .000 strings will be stripped from values
Verbose: script will tell you what's going on via Write-Verbose messages.

Example:

Process all CSVs in the folder C:\CSVs_are_here, strip NULLs and milliseconds, save processed CSVs to the folder C:\Processed_CSVs, be verbose:

.\Invoke-CsvCleanser.ps1 -InPath 'C:\CSVs_are_here' -OutPath 'C:\Processed_CSVs' -Nulls -Milliseconds -Verbose

Invoke-CsvCleanser.ps1 script:

Param
(
    [Parameter(ValueFromPipelineByPropertyName = $true)]
    [ValidateScript({
        if(!(Test-Path -LiteralPath $_ -PathType Container))
        {
            throw "Input folder doesn't exist: $_"
        }
        $true
    })]
    [ValidateNotNullOrEmpty()]
    [string]$InPath = (Get-Location -PSProvider FileSystem).Path,

    [Parameter(Mandatory = $true, ValueFromPipelineByPropertyName = $true)]
    [ValidateScript({
        if(!(Test-Path -LiteralPath $_ -PathType Container))
        {
            try
            {
                New-Item -ItemType Directory -Path $_ -Force
            }
            catch
            {
                throw "Can't create output folder: $_"
            }
        }
        $true
    })]
    [ValidateNotNullOrEmpty()]
    [string]$OutPath,

    [Parameter(ValueFromPipelineByPropertyName = $true)]
    [string]$Encoding = 'Default',

    [switch]$Nulls,

    [switch]$Milliseconds,

    [switch]$DoubleQuotes
)


if($Encoding -eq 'Default')
{
    # Set default encoding
    $FileEncoding = [System.Text.Encoding]::Default
}
else
{
    # Try to set user-specified encoding
    try
    {
        $FileEncoding = [System.Text.Encoding]::GetEncoding($Encoding)
    }
    catch
    {
        throw "Not valid encoding: $Encoding"
    }
}

$DQuotes = '"'
$Separator = ','
# https://stackoverflow.com/questions/15927291/how-to-split-a-string-by-comma-ignoring-comma-in-double-quotes
$SplitRegex = "$Separator(?=(?:[^$DQuotes]|$DQuotes[^$DQuotes]*$DQuotes)*$)"
# Regef to match NULL
$NullRegex = '^NULL$'
# Regex to match milliseconds: 23:00:00.000
$MillisecondsRegex = '(\d{2}:\d{2}:\d{2})(\.\d{3})'

Write-Verbose "Input folder: $InPath"
Write-Verbose "Output folder: $OutPath"

# Iterate over each CSV file in the $InPath
Get-ChildItem -LiteralPath $InPath -Filter '*.csv' |
    ForEach-Object {
        Write-Verbose "Current file: $($_.FullName)"
        $InFile = New-Object -TypeName System.IO.StreamReader -ArgumentList (
            $_.FullName,
            $FileEncoding
        ) -ErrorAction Stop
        Write-Verbose 'Created new StreamReader'

        $OutFile = New-Object -TypeName System.IO.StreamWriter -ArgumentList (
            (Join-Path -Path $OutPath -ChildPath $_.Name),
            $false,
            $FileEncoding
        ) -ErrorAction Stop
        Write-Verbose 'Created new StreamWriter'

        Write-Verbose 'Processing file...'
        while(($line = $InFile.ReadLine()) -ne $null)
        {
            $tmp = $line -split $SplitRegex |
                        ForEach-Object {

                            # Strip surrounding quotes
                            if($DoubleQuotes)
                            {
                                $_ = $_.Trim($DQuotes)
                            }

                            # Strip NULL strings
                            if($Nulls)
                            {
                                $_ = $_ -replace $NullRegex, ''
                            }

                            # Strip milliseconds
                            if($Milliseconds)
                            {
                                $_ = $_ -replace $MillisecondsRegex, '$1'
                            }

                            # Output current object to pipeline
                            $_
                        }
            # Write line to the new CSV file
            $OutFile.WriteLine($tmp -join $Separator)
        }

        Write-Verbose "Finished processing file: $($_.FullName)"
        Write-Verbose "Processed file is saved as: $($OutFile.BaseStream.Name)"

        # Close open files and cleanup objects
        $OutFile.Flush()
        $OutFile.Close()
        $OutFile.Dispose()

        $InFile.Close()
        $InFile.Dispose()
    }

Result:

DATE_COLUMN,DATETIME_COLUMN,TEXT_COLUMN,NUMBER_COLUMN
2015-05-01,2015-05-01 23:00:00,LOREM IPSUM,10.3456
,,,0

It would be interesting to see if one could pass lambdas as a way to make the file processing more flexible. Each lambda would perform a specific activity (removing NULLs, upper-casing, normalizing text, etc.)

This version gives full control over CSV processing. Just pass a scriptblock(s) to the Action parameter in the order you want them to execute.

Example: strip NULLs, strip milliseconds and then strip double quotes.

.\Invoke-CsvCleanser.ps1 -InPath 'C:\CSVs_are_here' -OutPath 'C:\Processed_CSVs' -Action {$_ = $_ -replace '^NULL$', '' }, {$_ = $_ -replace '(\d{2}:\d{2}:\d{2})(\.\d{3})', '$1'}, {$_ = $_.Trim('"')}

Invoke-CsvCleanser.ps1 with "lambdas":

Param
(
    [Parameter(ValueFromPipelineByPropertyName = $true)]
    [ValidateScript({
        if(!(Test-Path -LiteralPath $_ -PathType Container))
        {
            throw "Input folder doesn't exist: $_"
        }
        $true
    })]
    [ValidateNotNullOrEmpty()]
    [string]$InPath = (Get-Location -PSProvider FileSystem).Path,

    [Parameter(Mandatory = $true, ValueFromPipelineByPropertyName = $true)]
    [ValidateScript({
        if(!(Test-Path -LiteralPath $_ -PathType Container))
        {
            try
            {
                New-Item -ItemType Directory -Path $_ -Force
            }
            catch
            {
                throw "Can't create output folder: $_"
            }
        }
        $true
    })]
    [ValidateNotNullOrEmpty()]
    [string]$OutPath,

    [Parameter(ValueFromPipelineByPropertyName = $true)]
    [string]$Encoding = 'Default',

    [Parameter(Mandatory = $true, ValueFromPipelineByPropertyName = $true)]    
    [scriptblock[]]$Action
)


if($Encoding -eq 'Default')
{
    # Set default encoding
    $FileEncoding = [System.Text.Encoding]::Default
}
else
{
    # Try to set user-specified encoding
    try
    {
        $FileEncoding = [System.Text.Encoding]::GetEncoding($Encoding)
    }
    catch
    {
        throw "Not valid encoding: $Encoding"
    }
}

$DQuotes = '"'
$Separator = ','
# https://stackoverflow.com/questions/15927291/how-to-split-a-string-by-comma-ignoring-comma-in-double-quotes
$SplitRegex = "$Separator(?=(?:[^$DQuotes]|$DQuotes[^$DQuotes]*$DQuotes)*$)"

Write-Verbose "Input folder: $InPath"
Write-Verbose "Output folder: $OutPath"

# Iterate over each CSV file in the $InPath
Get-ChildItem -LiteralPath $InPath -Filter '*.csv' |
    ForEach-Object {
        Write-Verbose "Current file: $($_.FullName)"
        $InFile = New-Object -TypeName System.IO.StreamReader -ArgumentList (
            $_.FullName,
            $FileEncoding
        ) -ErrorAction Stop
        Write-Verbose 'Created new StreamReader'

        $OutFile = New-Object -TypeName System.IO.StreamWriter -ArgumentList (
            (Join-Path -Path $OutPath -ChildPath $_.Name),
            $false,
            $FileEncoding
        ) -ErrorAction Stop
        Write-Verbose 'Created new StreamWriter'

        Write-Verbose 'Processing file...'
        while(($line = $InFile.ReadLine()) -ne $null)
        {
            $tmp =  $line -split $SplitRegex |
                        ForEach-Object {
                            # Process each item
                            foreach($scriptblock in $Action) {
                                . $scriptblock
                            }
                            # Output current object to pipeline
                            $_
                        }
            # Write line to the new CSV file
            $OutFile.WriteLine($tmp -join $Separator)
        }

        Write-Verbose "Finished processing file: $($_.FullName)"
        Write-Verbose "Processed file is saved as: $($OutFile.BaseStream.Name)"

        # Close open files and cleanup objects
        $OutFile.Flush()
        $OutFile.Close()
        $OutFile.Dispose()

        $InFile.Close()
        $InFile.Dispose()
    }

来源：https://stackoverflow.com/questions/30399462/more-efficient-way-to-modify-a-csv-files-content

标签

powershell

csv

powershell-3.0

ssms-2012