How to redirect input in Powershell without BOM?

问题

I am trying to redirect input in Powershell by

Get-Content input.txt | my-program args

The problem is the piped UTF-8 text is preceded with a BOM (0xefbbbf), and my program cannot handle that correctly.

A minimal working example:

// File: Hex.java
import java.io.IOException;

public class Hex {
    public static void main(String[] dummy) {
        int ch;
        try {
            while ((ch = System.in.read()) != -1) {
                System.out.print(String.format("%02X ", ch));
            }
        } catch (IOException e) {
        }
    }
}

Then in powershell

javac Hex.java
Set-Content textfile "ABC" -Encoding Ascii
# Now the content of textfile is 0x41 42 43 0D 0A
Get-Content textfile | java Hex

or simply

javac Hex.java
Write-Output "ABC" | java Hex

In either case the output is EF BB BF 41 42 43 0D 0A

How can I pipe the text into the program without 0xefbbbf?

回答1:

^{Note: The following contains general information that in a normally functioning PowerShell environment would explain the OP's symptom. That the solution doesn't work in the OP's case is owed to machine-specific causes that are unknown at this point.}

To ensure that your Java program receives its input UTF-8-encoded without a BOM, you must set $OutputEncoding to a System.Text.UTF8Encoding instance that does not emit a BOM:

# Assigns UTF-8 encoding *without a BOM*.
# PowerShell uses this encoding to encode data piped to external programs.
# $OutputEncoding defaults to ASCII(!) in Windows PowerShell, and more sensibly
# to BOM-*less* UTF-8 in PowerShell [Core] v6+
$OutputEncoding = [Text.UTF8Encoding]::new($false)

Caveat: Do NOT use the seemingly equivalent New-Object Text.Utf8Encoding $false, because, due to the bug described in this GitHub issue, it won't work if you assign to $OutpuEncoding in a non-global scope, such as in a script.

If, by contrast, you use [Text.Encoding]::Utf8 (System.Text.Encoding.UTF8), you will get a BOM - which is what I suspect happened in your case.

Note that this problem is unrelated to the source encoding of any file read by Get-Content, because what is sent through the PowerShell pipeline is never a stream of raw bytes, but .NET objects, which in the case of Get-Content means that .NET strings are sent (System.String, internally a sequence of UTF-16 code units).

Because you're piping to an external program (a Java application, in your case), PowerShell character-encodes the (stringified-on-demand) objects sent to it based on preference variable $OutputEncoding, and the resulting encoding is what the external program receives.

Perhaps surprisingly, even though BOMs are typically only used in files, PowerShell respects the BOM setting of the encoding assigned to $OutputEncoding also in the pipeline, prepending it to the first line sent (only).

See the bottom section of this answer for more information about how PowerShell handles pipeline input for and output from external programs, including how it is [Console]::OutputEncoding that matters when PowerShell interprets data received from external programs.

To illustrate the difference using your sample program (note how using a PowerShell string literal as input is sufficient; no need to read from a file):

# Note the EF BB BF sequence representing the UTF-8 BOM.
# Enclosure in & { ... } ensures that a local, temporary copy of $OutputEncoding
# is used.
PS> & { $OutputEncoding = [Text.Encoding]::Utf8; 'hö' | java Hex }
EF BB BF 68 C3 B6 0D 0A

# Note the absence of EF BB BF, due to using a BOM-less
# UTF-8 encoding.
PS> & { $OutputEncoding = [Text.Utf8Encoding]::new($false); 'hö' | java Hex }
68 C3 B6 0D 0A

In Windows PowerShell, where $OutputEncoding defaults to ASCII(!), you'd see the following with the default in place:

# The default of ASCII(!) results in *lossy* encoding in Windows PowerShell.
PS> 'hö' | java Hex 
68 3F 0D 0A

Note that 3F represents the literal ? character, which is what the non-ASCII ö character was transliterated too, given that it has no representation in ASCII; in other words: information was lost.

PowerShell [Core] v6+ now sensibly defaults to BOM-less UTF-8, so the default behavior there is as expected.
While BOM-less UTF-8 is PowerShell [Core]'s consistent default, also for cmdlets that read from and write to files, on Windows [Console]::OutputEncoding still reflects the active OEM code page by default as of v7.0, so to correctly capture output from UTF-8-emitting external programs, it must be set to [Text.UTF8Encoding]::new($false) as well - see this GitHub issue.

回答2:

You could try setting the OutputEncoding to UTF8 without BOM:

# keep the current output encoding in a variable
$oldEncoding = [console]::OutputEncoding
# set the output encoding to use UTF8 without BOM
[console]::OutputEncoding = New-Object System.Text.UTF8Encoding $false

Get-Content input.txt | my-program args

# reset the output encoding to the previous
[console]::OutputEncoding = $oldEncoding

If the above has no effect and your program does understand UTF8, but only expects it to be without the 3-byte BOM, then you can try removing the BOM from the content and pipe the result your program

(Get-Content 'input.txt' -Raw -Encoding UTF8) -replace '^\xef\xbb\xbf' |  my-program args

Edit

If ever you have 'hacked' the codepage with chcp 65001, I recommend turning that back to chcp 5129 for English - New Zealand. See here

来源：https://stackoverflow.com/questions/60124466/how-to-redirect-input-in-powershell-without-bom

标签

powershell

encoding

pipe

byte-order-mark