问题
I'm attempting to remove some of the detritus that SSMS 2012 generates when a query's results are exported as CSV.
For example, it includes the word 'NULL' for null
values and adds milliseconds to datetime
values:
DATE_COLUMN,DATETIME_COLUMN,TEXT_COLUMN,NUMBER_COLUMN
2015-05-01,2015-05-01 23:00:00.000,LOREM IPSUM,10.3456
NULL,NULL,NULL,0
Unfortunately, Excel doesn't automatically format datetime
values with fractional seconds correctly, which lead to confusion amongst the customers ('What happened to the date field that I requested?') and more work for me (having to convert the CSV to XLSX and format the columns correctly prior to distribution).
The goal is to strip the CSV file of NULL
and .000
values:
DATE_COLUMN,DATETIME_COLUMN,TEXT_COLUMN,NUMBER_COLUMN
2015-05-01,2015-05-01 23:00:00,LOREM IPSUM,10.3456
,,,0
Excel will open this file and format it properly without further technical assistance.
To that end, I wrote:
Function Invoke-CsvCleanser {
[CmdletBinding()]
Param(
[parameter(Mandatory=$true)]
[String]
$Path,
[switch]
$Nulls,
[switch]
$Milliseconds
)
PROCESS {
# open the file
$data = Import-Csv $path
# process each row
$data | Foreach-Object {
# process each column
Foreach ($property in $_.PSObject.Properties) {
# if column contains 'NULL', replace it with ''
if ($Nulls -and ($property.Value -eq 'NULL')) {
$property.Value = $property.Value -replace 'NULL', ''
}
# if column contains a date/time value, remove milliseconds
elseif ( $Milliseconds -and (isDate($property.Value)) ) {
$property.Value = $property.Value -replace '.000', ''
}
}
}
# save file
$data | Export-Csv -Path $Path -NoTypeInformation
}
}
function IsDate($object) {
[Boolean]($object -as [DateTime])
}
PS> Invoke-CsvCleanser 'C:\Users\Foobar\Desktop\0000.csv' -Nulls -Milliseconds
This works fine when the file size is small, but is quite inefficient for large files. Ideally, Invoke-CsvCleanser
would make use of the pipeline.
Is there a better way to do this?
回答1:
Import-CSV
always loads entire file in memory, so it's slow. Here is modified script from my answer to this question: CSV formatting - strip qualifier from specific fields.
It uses raw file processing, so it should be significantly faster. NULL
s and milliseconds are matched\replaced using regex. Script is able to mass-convert CSV's.
Regex to split CSV is from this question: How to split a string by comma ignoring comma in double quotes
Save this script as Invoke-CsvCleanser.ps1
. It accepts following arguments:
- InPath: folder to read CSVs from. If not specified, the current directory is used.
- OutPath: folder to save processed CSVs to. Will be created, if not exist.
- Encoding: If not specified, script will use system's current ANSI code page to read the files. You can get other valid encodings for your system in PowerShell console like this:
[System.Text.Encoding]::GetEncodings()
- DoubleQuotes: switch, if specified, surrounding double quotes will be stripped from values
- Nulls: switch, if specified,
NULL
strings will be stripped from values - Milliseconds: switch, if specified,
.000
strings will be stripped from values - Verbose: script will tell you what's going on via
Write-Verbose
messages.
Example:
Process all CSVs in the folder C:\CSVs_are_here
, strip NULLs and milliseconds, save processed CSVs to the folder C:\Processed_CSVs
, be verbose:
.\Invoke-CsvCleanser.ps1 -InPath 'C:\CSVs_are_here' -OutPath 'C:\Processed_CSVs' -Nulls -Milliseconds -Verbose
Invoke-CsvCleanser.ps1
script:
Param
(
[Parameter(ValueFromPipelineByPropertyName = $true)]
[ValidateScript({
if(!(Test-Path -LiteralPath $_ -PathType Container))
{
throw "Input folder doesn't exist: $_"
}
$true
})]
[ValidateNotNullOrEmpty()]
[string]$InPath = (Get-Location -PSProvider FileSystem).Path,
[Parameter(Mandatory = $true, ValueFromPipelineByPropertyName = $true)]
[ValidateScript({
if(!(Test-Path -LiteralPath $_ -PathType Container))
{
try
{
New-Item -ItemType Directory -Path $_ -Force
}
catch
{
throw "Can't create output folder: $_"
}
}
$true
})]
[ValidateNotNullOrEmpty()]
[string]$OutPath,
[Parameter(ValueFromPipelineByPropertyName = $true)]
[string]$Encoding = 'Default',
[switch]$Nulls,
[switch]$Milliseconds,
[switch]$DoubleQuotes
)
if($Encoding -eq 'Default')
{
# Set default encoding
$FileEncoding = [System.Text.Encoding]::Default
}
else
{
# Try to set user-specified encoding
try
{
$FileEncoding = [System.Text.Encoding]::GetEncoding($Encoding)
}
catch
{
throw "Not valid encoding: $Encoding"
}
}
$DQuotes = '"'
$Separator = ','
# https://stackoverflow.com/questions/15927291/how-to-split-a-string-by-comma-ignoring-comma-in-double-quotes
$SplitRegex = "$Separator(?=(?:[^$DQuotes]|$DQuotes[^$DQuotes]*$DQuotes)*$)"
# Regef to match NULL
$NullRegex = '^NULL$'
# Regex to match milliseconds: 23:00:00.000
$MillisecondsRegex = '(\d{2}:\d{2}:\d{2})(\.\d{3})'
Write-Verbose "Input folder: $InPath"
Write-Verbose "Output folder: $OutPath"
# Iterate over each CSV file in the $InPath
Get-ChildItem -LiteralPath $InPath -Filter '*.csv' |
ForEach-Object {
Write-Verbose "Current file: $($_.FullName)"
$InFile = New-Object -TypeName System.IO.StreamReader -ArgumentList (
$_.FullName,
$FileEncoding
) -ErrorAction Stop
Write-Verbose 'Created new StreamReader'
$OutFile = New-Object -TypeName System.IO.StreamWriter -ArgumentList (
(Join-Path -Path $OutPath -ChildPath $_.Name),
$false,
$FileEncoding
) -ErrorAction Stop
Write-Verbose 'Created new StreamWriter'
Write-Verbose 'Processing file...'
while(($line = $InFile.ReadLine()) -ne $null)
{
$tmp = $line -split $SplitRegex |
ForEach-Object {
# Strip surrounding quotes
if($DoubleQuotes)
{
$_ = $_.Trim($DQuotes)
}
# Strip NULL strings
if($Nulls)
{
$_ = $_ -replace $NullRegex, ''
}
# Strip milliseconds
if($Milliseconds)
{
$_ = $_ -replace $MillisecondsRegex, '$1'
}
# Output current object to pipeline
$_
}
# Write line to the new CSV file
$OutFile.WriteLine($tmp -join $Separator)
}
Write-Verbose "Finished processing file: $($_.FullName)"
Write-Verbose "Processed file is saved as: $($OutFile.BaseStream.Name)"
# Close open files and cleanup objects
$OutFile.Flush()
$OutFile.Close()
$OutFile.Dispose()
$InFile.Close()
$InFile.Dispose()
}
Result:
DATE_COLUMN,DATETIME_COLUMN,TEXT_COLUMN,NUMBER_COLUMN
2015-05-01,2015-05-01 23:00:00,LOREM IPSUM,10.3456
,,,0
It would be interesting to see if one could pass lambdas as a way to make the file processing more flexible. Each lambda would perform a specific activity (removing NULLs, upper-casing, normalizing text, etc.)
This version gives full control over CSV processing. Just pass a scriptblock(s) to the Action
parameter in the order you want them to execute.
Example: strip NULL
s, strip milliseconds and then strip double quotes.
.\Invoke-CsvCleanser.ps1 -InPath 'C:\CSVs_are_here' -OutPath 'C:\Processed_CSVs' -Action {$_ = $_ -replace '^NULL$', '' }, {$_ = $_ -replace '(\d{2}:\d{2}:\d{2})(\.\d{3})', '$1'}, {$_ = $_.Trim('"')}
Invoke-CsvCleanser.ps1
with "lambdas":
Param
(
[Parameter(ValueFromPipelineByPropertyName = $true)]
[ValidateScript({
if(!(Test-Path -LiteralPath $_ -PathType Container))
{
throw "Input folder doesn't exist: $_"
}
$true
})]
[ValidateNotNullOrEmpty()]
[string]$InPath = (Get-Location -PSProvider FileSystem).Path,
[Parameter(Mandatory = $true, ValueFromPipelineByPropertyName = $true)]
[ValidateScript({
if(!(Test-Path -LiteralPath $_ -PathType Container))
{
try
{
New-Item -ItemType Directory -Path $_ -Force
}
catch
{
throw "Can't create output folder: $_"
}
}
$true
})]
[ValidateNotNullOrEmpty()]
[string]$OutPath,
[Parameter(ValueFromPipelineByPropertyName = $true)]
[string]$Encoding = 'Default',
[Parameter(Mandatory = $true, ValueFromPipelineByPropertyName = $true)]
[scriptblock[]]$Action
)
if($Encoding -eq 'Default')
{
# Set default encoding
$FileEncoding = [System.Text.Encoding]::Default
}
else
{
# Try to set user-specified encoding
try
{
$FileEncoding = [System.Text.Encoding]::GetEncoding($Encoding)
}
catch
{
throw "Not valid encoding: $Encoding"
}
}
$DQuotes = '"'
$Separator = ','
# https://stackoverflow.com/questions/15927291/how-to-split-a-string-by-comma-ignoring-comma-in-double-quotes
$SplitRegex = "$Separator(?=(?:[^$DQuotes]|$DQuotes[^$DQuotes]*$DQuotes)*$)"
Write-Verbose "Input folder: $InPath"
Write-Verbose "Output folder: $OutPath"
# Iterate over each CSV file in the $InPath
Get-ChildItem -LiteralPath $InPath -Filter '*.csv' |
ForEach-Object {
Write-Verbose "Current file: $($_.FullName)"
$InFile = New-Object -TypeName System.IO.StreamReader -ArgumentList (
$_.FullName,
$FileEncoding
) -ErrorAction Stop
Write-Verbose 'Created new StreamReader'
$OutFile = New-Object -TypeName System.IO.StreamWriter -ArgumentList (
(Join-Path -Path $OutPath -ChildPath $_.Name),
$false,
$FileEncoding
) -ErrorAction Stop
Write-Verbose 'Created new StreamWriter'
Write-Verbose 'Processing file...'
while(($line = $InFile.ReadLine()) -ne $null)
{
$tmp = $line -split $SplitRegex |
ForEach-Object {
# Process each item
foreach($scriptblock in $Action) {
. $scriptblock
}
# Output current object to pipeline
$_
}
# Write line to the new CSV file
$OutFile.WriteLine($tmp -join $Separator)
}
Write-Verbose "Finished processing file: $($_.FullName)"
Write-Verbose "Processed file is saved as: $($OutFile.BaseStream.Name)"
# Close open files and cleanup objects
$OutFile.Flush()
$OutFile.Close()
$OutFile.Dispose()
$InFile.Close()
$InFile.Dispose()
}
来源:https://stackoverflow.com/questions/30399462/more-efficient-way-to-modify-a-csv-files-content