问题
Background
I'm hoping to write code which uses Microsoft.VisualBasic.FileIO.TextFieldParser
to parse some csv data.
The system I'm generating this data for doesn't understand quotes; so I can't escape the delimiter; but rather have to replace it.
I've found a solution using the above text parser, but I've only seen people use it with input from files. Rather than writing my data to file only to import it again, I'd rather keep things in memory / make use of this class's constructor which accepts a stream as input.
Ideally it would be able to take a feed direct from whichever memory stream's used for the pipeline; but I couldn't work out how to access that. In my current code I create my own memory stream and feed data to it from the pipeline; then attempt to read from that. Unfortunately I'm missing something.
Questions
- How do you read from / write to memory streams in PowerShell?
- Is it possible to read directly from the stream which is being fed into the function's pipeline?
Code
clear-host
[Reflection.Assembly]::LoadWithPartialName("System.IO") | out-null
#[Reflection.Assembly]::LoadWithPartialName("Microsoft.VisualBasic") | out-null
function Clean-CsvStream {
[CmdletBinding()]
param (
[Parameter(Mandatory = $true, ValueFromPipeline=$true)]
[string]$Line
,
[Parameter(Mandatory = $false)]
[char]$Delimiter = ','
)
begin {
[System.IO.MemoryStream]$memStream = New-Object System.IO.MemoryStream
[System.IO.StreamWriter]$writeStream = New-Object System.IO.StreamWriter($memStream)
[System.IO.StreamReader]$readStream = New-Object System.IO.StreamReader($memStream)
#[Microsoft.VisualBasic.FileIO.TextFieldParser]$Parser = new-object Microsoft.VisualBasic.FileIO.TextFieldParser($memStream)
#$Parser.SetDelimiters($Delimiter)
#$Parser.HasFieldsEnclosedInQuotes = $true
#$writeStream.AutoFlush = $true
}
process {
$writeStream.WriteLine($_)
#$writeStream.Flush() #maybe we need to flush it before the reader will see it?
write-output $readStream.ReadLine()
#("Line: {0:000}" -f $Parser.LineNumber)
#write-output $Parser.ReadFields()
}
end {
#close streams and dispose (dodgy catch all's in case object's disposed before we call Dispose)
#try {$Parser.Close(); $Parser.Dispose()} catch{}
try {$readStream.Close(); $readStream.Dispose()} catch{}
try {$writeStream.Close(); $writeStream.Dispose()} catch{}
try {$memStream.Close(); $memStream.Dispose()} catch{}
}
}
1,2,3,4 | Clean-CsvStream -$Delimiter ';' #nothing like the real data, but I'm not interested in actual CSV cleansing at this point
Workaround
In the meantime my solution is just to do this replace on the objects's properties rather than the CSV rows.
$cols = $objectArray | Get-Member | ?{$_.MemberType -eq 'NoteProperty'} | select -ExpandProperty name
$objectArray | %{$csvRow =$_; ($cols | %{($csvRow.$_ -replace "[`n,]",':')}) -join ',' }
Update
I realised the missing code was $memStream.Seek(0, [System.IO.SeekOrigin]::Begin) | out-null;
However this isn't behaving entirely as expected; i.e. the first row of my CSV's showing twice, and other output's in the wrong order; so presumably I've misunderstood how to use Seek
.
clear-host
[Reflection.Assembly]::LoadWithPartialName("System.IO") | out-null
[Reflection.Assembly]::LoadWithPartialName("Microsoft.VisualBasic") | out-null
function Clean-CsvStream {
[CmdletBinding()]
param (
[Parameter(Mandatory = $true, ValueFromPipeline=$true)]
[string]$CsvRow
,
[Parameter(Mandatory = $false)]
[char]$Delimiter = ','
,
[Parameter(Mandatory = $false)]
[regex]$InvalidCharRegex
,
[Parameter(Mandatory = $false)]
[string]$ReplacementString
)
begin {
[System.IO.MemoryStream]$memStream = New-Object System.IO.MemoryStream
[System.IO.StreamWriter]$writeStream = New-Object System.IO.StreamWriter($memStream)
[Microsoft.VisualBasic.FileIO.TextFieldParser]$Parser = new-object Microsoft.VisualBasic.FileIO.TextFieldParser($memStream)
$Parser.SetDelimiters($Delimiter)
$Parser.HasFieldsEnclosedInQuotes = $true
$writeStream.AutoFlush = $true
}
process {
if ($InvalidCharRegex) {
$writeStream.WriteLine($CsvRow)
#flush here if not auto
$memStream.Seek(0, [System.IO.SeekOrigin]::Begin) | out-null;
write-output (($Parser.ReadFields() | %{$_ -replace $InvalidCharRegex,$ReplacementString }) -join $Delimiter)
} else { #if we're not replacing anything, keep it simple
$CsvRow
}
}
end {
"end {"
try {$Parser.Close(); $Parser.Dispose()} catch{}
try {$writeStream.Close(); $writeStream.Dispose()} catch{}
try {$memStream.Close(); $memStream.Dispose()} catch{}
"} #end"
}
}
$csv = @(
(new-object -TypeName PSCustomObject -Property @{A="this is regular text";B="nothing to see here";C="all should be good"})
,(new-object -TypeName PSCustomObject -Property @{A="this is regular text2";B="what the`nLine break!";C="all should be good2"})
,(new-object -TypeName PSCustomObject -Property @{A="this is regular text3";B="ooh`r`nwindows line break!";C="all should be good3"})
,(new-object -TypeName PSCustomObject -Property @{A="this is regular text4";B="I've got;a semi";C="all should be good4"})
,(new-object -TypeName PSCustomObject -Property @{A="this is regular text5";B="""You're Joking!"" said the Developer`r`n""No honestly; it's all about the secret VB library"" responded the Google search result";C="all should be good5"})
) | convertto-csv -Delimiter ';' -NoTypeInformation
$csv | Clean-CsvStream -Delimiter ';' -InvalidCharRegex "[`r`n;]" -ReplacementString ':'
回答1:
After a lot of playing around it seems this works:
clear-host
[Reflection.Assembly]::LoadWithPartialName("System.IO") | out-null
[Reflection.Assembly]::LoadWithPartialName("Microsoft.VisualBasic") | out-null
function Clean-CsvStream {
[CmdletBinding()]
param (
[Parameter(Mandatory = $true, ValueFromPipeline=$true)]
[string]$CsvRow
,
[Parameter(Mandatory = $false)]
[char]$Delimiter = ','
,
[Parameter(Mandatory = $false)]
[regex]$InvalidCharRegex
,
[Parameter(Mandatory = $false)]
[string]$ReplacementString
)
begin {
[bool]$IsSimple = [string]::IsNullOrEmpty($InvalidCharRegex)
if(-not $IsSimple) {
[System.IO.MemoryStream]$memStream = New-Object System.IO.MemoryStream
[System.IO.StreamWriter]$writeStream = New-Object System.IO.StreamWriter($memStream)
[Microsoft.VisualBasic.FileIO.TextFieldParser]$Parser = new-object Microsoft.VisualBasic.FileIO.TextFieldParser($memStream)
$Parser.SetDelimiters($Delimiter)
$Parser.HasFieldsEnclosedInQuotes = $true
}
}
process {
if ($IsSimple) {
$CsvRow
} else { #if we're not replacing anything, keep it simple
[long]$seekStart = $memStream.Seek(0, [System.IO.SeekOrigin]::Current)
$writeStream.WriteLine($CsvRow)
$writeStream.Flush()
$memStream.Seek($seekStart, [System.IO.SeekOrigin]::Begin) | out-null
write-output (($Parser.ReadFields() | %{$_ -replace $InvalidCharRegex,$ReplacementString }) -join $Delimiter)
}
}
end {
if(-not $IsSimple) {
try {$Parser.Close(); $Parser.Dispose()} catch{}
try {$writeStream.Close(); $writeStream.Dispose()} catch{}
try {$memStream.Close(); $memStream.Dispose()} catch{}
}
}
}
$csv = @(
(new-object -TypeName PSCustomObject -Property @{A="this is regular text";B="nothing to see here";C="all should be good"})
,(new-object -TypeName PSCustomObject -Property @{A="this is regular text2";B="what the`nLine break!";C="all should be good2"})
,(new-object -TypeName PSCustomObject -Property @{A="this is regular text3";B="ooh`r`nwindows line break!";C="all should be good3"})
,(new-object -TypeName PSCustomObject -Property @{A="this is regular text4";B="I've got;a semi";C="all should be good4"})
,(new-object -TypeName PSCustomObject -Property @{A="this is regular text5";B="""You're Joking!"" said the Developer`r`n""No honestly; it's all about the secret VB library"" responded the Google search result";C="all should be good5"})
) | convertto-csv -Delimiter ';' -NoTypeInformation
$csv | Clean-CsvStream -Delimiter ';' -InvalidCharRegex "[`r`n;]" -ReplacementString ':'
i.e.
- seek the current position prior to writing
- then write
- then flush (if not auto)
- then seek the start of the data
- then read
- repeat
I'm not certain this is correct though; as I can't find any good examples or docs explaining, so just played about until something worked which vaguely made sense.
I'm also still interested if anyone knows how to read direct from the pipeline stream; i.e. to remove the additional overhead of bonus streams.
For @M.R.'s comment
Sorry this is so late; in case it's of use to others:
If the end of line delimiter is CrLf (\r\n
) rather than just Cr (\r
) then it's easy to disambiguate between the end of record/line, and the line breaks within a field:
Get-Content -LiteralPath 'D:\test\file to clean.csv' -Delimiter "`r`n" |
%{$_.ToString().TrimEnd("`r`n")} | #the delimiter is left on the end of the string; remove it
%{('"{0}"' -f $_) -replace '\|','"|"'} | #insert quotes at start and end of line, as well as around delimeters
ConvertFrom-Csv -Delimiter '|' #treat the pipeline content as a valid pipe delimitted csv
However, if not you'll have no way of telling which Cr is the end of record, and which just a break in the text. You could get around this slightly by counting the number of pipes; i.e. as if you have 5 columns, any CRs before the fourth delimiter are line breaks rather than the end of record. However, if there's another line break you can't be sure if that's a line break in the last column's data, or the end of that row. If you know that either the first or the last column do not contain line breaks (or both) you can work around that. For all these more complex scenarios I suspect a regex would be the best option; using something like select-string
to apply it. If this is required; post as a question on here giving your exact requirements & info on what you've attempted already, and others can help you out.
来源:https://stackoverflow.com/questions/32016054/reading-from-the-pipeline-stream-in-powershell