Different behaviour and output when piping through CMD and PowerShell

不羁的心 提交于 2020-02-02 13:16:43

问题


I am trying to pipe the content of a file to a simple ASCII symmetrical encryption program i made. It's a simple program that reads input from STDIN and adds or subtracts a certain value (224) to each byte of the input. For example: if the first byte is 4 and we want to encrypt, then it becomes 228. If it exceeds 255, the program just performs some modulo.

This is the output I get with cmd (test.txt contains "this is a test"):

    type .\test.txt | .\Crypt.exe --encrypt | .\Crypt.exe --decrypt
    this is a test

It also works the other way, thus it is a symmetrical encryption algorithm

    type .\test.txt | .\Crypt.exe --decrypt | .\Crypt.exe --encrypt
    this is a test

But, the behaviour on PowerShell is different. When encrypting first, I get:

    type .\test.txt | .\Crypt.exe --encrypt | .\Crypt.exe --decrypt
    this is a test_*

And that is what I get when decrypting first:

Maybe is an encoding problem. Thanks in advance.


回答1:


tl;dr:

If you need raw byte handling and/or need to prevent PowerShell from situationally adding a trailing newline to your text data, avoid the PowerShell pipeline altogether.
Instead, shell out to cmd with /c:

cmd /c 'type .\test.txt | .\Crypt.exe --encrypt | .\Crypt.exe --decrypt'

Note that if you want to capture the output in a PowerShell variable, you need to make sure that [Console]::OutputEncoding matches your .\Crypt.exe program's (effective) output encoding (the active OEM code page), which should be true by default in this case; see the next section for details.

Generally, however, byte manipulation of text data is best avoided.


There are two separate problems, only one of which as a simple solution:


Problem 1: There is indeed an encoding problem, as you suspected:

PowerShell invisibly inserts itself as an intermediary in pipelines, even when sending data to and receiving data from external programs: It converts data from and to .NET strings (System.String), which are sequences of UTF-16 code units.

In order to send to and receive data from external programs, you need to match their character encoding; in your case, with a Windows console application that uses raw byte handling, the implied encoding is the system's active OEM code page.

  • On sending data, PowerShell uses the encoding of the $OutputEncoding preference variable to encode (what is invariably treated as text) data, which defaults to ASCII(!) in Windows PowerShell, and UTF-8 in PowerShell [Core].

  • The receiving end is covered by default: PowerShell uses [Console]::OutputEncoding (which itself reflects the code page reported by chcp) for decoding data received, and on Windows this by default reflects the active OEM code page, both in Windows PowerShell and PowerShell [Core][1].

To fix your primary problem, you therefore need to set $OutputEncoding to the active OEM code page:

# Make sure that PowerShell uses the OEM code page when sending
# data to `.\Crypt.exe`
$OutputEncoding = [Console]::OutputEncoding

Problem 2: PowerShell invariably appends a trailing newline to data that doesn't already have one when piping data to external programs:

That is, "foo" | .\Crypt.exe doesn't send (the $OutputEncoding-encoded bytes representing) "foo" to .\Crypt.exe's stdin, it sends "foo`r`n" on Windows; i.e., a (platform-appropriate) newline sequence (CRLF on Windows) is automatically and invariably appended (unless the string already happens to have a trailing newline).

This problematic behavior is discussed in this GitHub issue and also in this answer.

In your specific case, the implicitly appended "`r`n" is also subject to the byte-value-shifting, which means that the 1st Crypt.exe calls transforms it to -*, causing another "`r`n" to be appended when the data is sent to the 2nd Crypt.exe call.

The net result is an extra newline that is round-tripped (the intermediate -*), plus an encrypted newline that results in φΩ).


In short: If your input data had no trailing newline, you'll have to cut off the last 4 characters from the result (representing the round-tripped and the inadvertently encrypted newline sequences):

# Ensure that .\Crypt.exe output is correctly decoded.
$OutputEncoding = [Console]::OutputEncoding

# Invoke the command and capture its output in variable $result.
# Note the use of the `Get-Content` cmdlet; in PowerShell, `type`
# is simply a built-in *alias* for it.
$result = Get-Content .\test.txt | .\Crypt.exe --decrypt | .\Crypt.exe --encrypt

# Remove the last 4 chars. and print the result.
$result.Substring(0, $result.Length - 4)

Given that calling cmd /c as shown at the top of the answer works too, that hardly seems worth it.


How PowerShell handles pipeline data with external programs:

Unlike cmd (or POSIX-like shells such as bash):

  • PowerShell doesn't support raw byte data in pipelines.[2]
  • When talking to external programs, it only knows text (whereas it passes .NET objects when talking to PowerShell's own commands, which is where much of its power comes from).

Specifically, this works as follows:

  • When you send data to an external program via the pipeline (to its stdin stream):

    • It is converted to text (strings) using the character encoding specified in the $OutputEncoding preference variable, which defaults to ASCII(!) in Windows PowerShell, and (BOM-less) UTF-8 in PowerShell [Core].

      • If the data is not captured or redirected by PowerShell, encoding problems may not always become apparent, namely if an external program is implemented in a way that uses the Windows Unicode console API to print to the display.
    • Something that isn't already text (a string) is stringified using PowerShell's default output formatting (the same format you see when you print to the console), with an important caveat:

      • If the (last) input object already is a string that doesn't itself have a trailing newline, one is invariably appended (and even an existing trailing newline is replaced with the platform-native one, if different).
      • This behavior can cause problems, as discussed in this GitHub issue and also in this answer.
  • When you capture / redirect data from an external program (from its stdout stream), it is invariably decoded as lines of text (strings), based on the encoding specified in [Console]::OutputEncoding, which defaults to the active OEM code page on Windows (surprisingly, in both PowerShell editions, as of v7.0-preview6[1]).

  • PowerShell-internally text is represented using the .NET System.String type, which is based on UTF-16 code units (often loosely, but incorrectly called "Unicode"[3]).

The above also applies:

  • when piping data between external programs,

  • when data is redirected to a file; that is, irrespective of the source of the data and its original character encoding, PowerShell uses its default encoding(s) when sending data to files; in Windows PowerShell, > produces UTF-16LE-encoded files (with BOM), whereas PowerShell [Core] sensibly defaults to BOM-less UTF-8 (consistently, across file-writing cmdlets).

Adding support for raw data passing between external programs and to-file redirections is the subject of this GitHub issue.


[1] In PowerShell [Core], given that $OutputEncoding commendably already defaults to UTF-8, it would make sense to have [Console]::OutputEncoding be the same - i.e., for the active code page to be effectively 65001 on Windows, as suggested in this GitHub issue.

[2] With input from a file, the closest you can get to raw byte handling is to read the file as a .NET System.Byte array with Get-Content -AsByteStream (PowerShell [Core]) / Get-Content -Encoding Byte (Windows PowerShell), but the only way you can further process such as an array is to pipe to a PowerShell command that is designed to handle a byte array, or by passing it to a .NET type's method that expects a byte array. If you tried to send such an array to an external program via the pipeline, each byte would be sent as its decimal string representation on its own line.

[3] Unicode is the name of the abstract standard describing a "global alphabet". In concrete use, it has various standard encodings, UTF-8 and UTF-16 being the most widely used.




回答2:


Cmd uses 8 bit OEM. Powershell uses Unicode.

Standard (and automatic) conversion would be from locale specific OEM to locale specific ANSI, then ANSI to Unicode.

See https://docs.microsoft.com/en-us/windows/console/console-code-pages

In Unicode characters 0 - 31 and 128 - 160 don't have glyphs. They are control characters.

I got tired of there not being a Unicode character table (only ANSI) so I wrote one.

        Name    OEM Type    Range   (Unicode conversion of OEM Character)
0   0x0 ␀       Control     Control Codes
1   0x1 ␁   ☺   Control     Control Codes
2   0x2 ␂   ☻   Control     Control Codes
3   0x3 ␃   ♥   Control     Control Codes
4   0x4 ␄   ♦   Control     Control Codes
5   0x5 ␅   ♣   Control     Control Codes
6   0x6 ␆   ♠   Control     Control Codes
7   0x7 ␇   •   Control     Control Codes
8   0x8 ␈   ◘   Control     Control Codes
9   0x9 ␉   ○   Blank Control Space     Control Codes
10  0xA ␊   ◙   Control Space   Control Codes
11  0xB ␋   ♂   Control Space   Control Codes
12  0xC ␌   ♀   Control Space   Control Codes
13  0xD ␍   ♪   Control Space   Control Codes
14  0xE ␎   ♫   Control     Control Codes
15  0xF ␏   ☼   Control     Control Codes
16  0x10    ␐   ►   Control     Control Codes
17  0x11    ␑   ◄   Control     Control Codes
18  0x12    ␒   ↕   Control     Control Codes
19  0x13    ␓   ‼   Control     Control Codes
20  0x14    ␔   ¶   Control     Control Codes
21  0x15    ␕   §   Control     Control Codes
22  0x16    ␖   ▬   Control     Control Codes
23  0x17    ␗   ↨   Control     Control Codes
24  0x18    ␘   ↑   Control     Control Codes
25  0x19    ␙   ↓   Control     Control Codes
26  0x1A    ␚   →   Control     Control Codes
27  0x1B    ␛   ←   Control     Control Codes
28  0x1C    ␜   ∟   Control     Control Codes
29  0x1D    ␝   ↔   Control     Control Codes
30  0x1E    ␞   ▲   Control     Control Codes
31  0x1F    ␟   ▼   Control     Control Codes

        Char    Type    Range
32  0x20        Blank Space     Basic Latin
33  0x21    !   Punct   Basic Latin
34  0x22    "   Punct   Basic Latin
35  0x23    #   Punct   Basic Latin
36  0x24    $   Punct   Basic Latin
37  0x25    %   Punct   Basic Latin
38  0x26    &   Punct   Basic Latin
39  0x27    '   Punct   Basic Latin
40  0x28    (   Punct   Basic Latin
41  0x29    )   Punct   Basic Latin
42  0x2A    *   Punct   Basic Latin
43  0x2B    +   Punct   Basic Latin
44  0x2C    ,   Punct   Basic Latin
45  0x2D    -   Punct   Basic Latin
46  0x2E    .   Punct   Basic Latin
47  0x2F    /   Punct   Basic Latin
48  0x30    0   Number Hex  Basic Latin
49  0x31    1   Number Hex  Basic Latin
50  0x32    2   Number Hex  Basic Latin
51  0x33    3   Number Hex  Basic Latin
52  0x34    4   Number Hex  Basic Latin
53  0x35    5   Number Hex  Basic Latin
54  0x36    6   Number Hex  Basic Latin
55  0x37    7   Number Hex  Basic Latin
56  0x38    8   Number Hex  Basic Latin
57  0x39    9   Number Hex  Basic Latin
58  0x3A    :   Punct   Basic Latin
59  0x3B    ;   Punct   Basic Latin
60  0x3C    <   Punct   Basic Latin
61  0x3D    =   Punct   Basic Latin
62  0x3E    >   Punct   Basic Latin
63  0x3F    ?   Punct   Basic Latin
64  0x40    @   Punct   Basic Latin
65  0x41    A   Alpha Upper Hex     Basic Latin
66  0x42    B   Alpha Upper Hex     Basic Latin
67  0x43    C   Alpha Upper Hex     Basic Latin
68  0x44    D   Alpha Upper Hex     Basic Latin
69  0x45    E   Alpha Upper Hex     Basic Latin
70  0x46    F   Alpha Upper Hex     Basic Latin
71  0x47    G   Alpha Upper     Basic Latin
72  0x48    H   Alpha Upper     Basic Latin
73  0x49    I   Alpha Upper     Basic Latin
74  0x4A    J   Alpha Upper     Basic Latin
75  0x4B    K   Alpha Upper     Basic Latin
76  0x4C    L   Alpha Upper     Basic Latin
77  0x4D    M   Alpha Upper     Basic Latin
78  0x4E    N   Alpha Upper     Basic Latin
79  0x4F    O   Alpha Upper     Basic Latin
80  0x50    P   Alpha Upper     Basic Latin
81  0x51    Q   Alpha Upper     Basic Latin
82  0x52    R   Alpha Upper     Basic Latin
83  0x53    S   Alpha Upper     Basic Latin
84  0x54    T   Alpha Upper     Basic Latin
85  0x55    U   Alpha Upper     Basic Latin
86  0x56    V   Alpha Upper     Basic Latin
87  0x57    W   Alpha Upper     Basic Latin
88  0x58    X   Alpha Upper     Basic Latin
89  0x59    Y   Alpha Upper     Basic Latin
90  0x5A    Z   Alpha Upper     Basic Latin
91  0x5B    [   Punct   Basic Latin
92  0x5C    \   Punct   Basic Latin
93  0x5D    ]   Punct   Basic Latin
94  0x5E    ^   Punct   Basic Latin
95  0x5F    _   Punct   Basic Latin
96  0x60    `   Punct   Basic Latin
97  0x61    a   Alpha Lower Hex     Basic Latin
98  0x62    b   Alpha Lower Hex     Basic Latin
99  0x63    c   Alpha Lower Hex     Basic Latin
100 0x64    d   Alpha Lower Hex     Basic Latin
101 0x65    e   Alpha Lower Hex     Basic Latin
102 0x66    f   Alpha Lower Hex     Basic Latin
103 0x67    g   Alpha Lower     Basic Latin
104 0x68    h   Alpha Lower     Basic Latin
105 0x69    i   Alpha Lower     Basic Latin
106 0x6A    j   Alpha Lower     Basic Latin
107 0x6B    k   Alpha Lower     Basic Latin
108 0x6C    l   Alpha Lower     Basic Latin
109 0x6D    m   Alpha Lower     Basic Latin
110 0x6E    n   Alpha Lower     Basic Latin
111 0x6F    o   Alpha Lower     Basic Latin
112 0x70    p   Alpha Lower     Basic Latin
113 0x71    q   Alpha Lower     Basic Latin
114 0x72    r   Alpha Lower     Basic Latin
115 0x73    s   Alpha Lower     Basic Latin
116 0x74    t   Alpha Lower     Basic Latin
117 0x75    u   Alpha Lower     Basic Latin
118 0x76    v   Alpha Lower     Basic Latin
119 0x77    w   Alpha Lower     Basic Latin
120 0x78    x   Alpha Lower     Basic Latin
121 0x79    y   Alpha Lower     Basic Latin
122 0x7A    z   Alpha Lower     Basic Latin
123 0x7B    {   Punct   Basic Latin
124 0x7C    |   Punct   Basic Latin
125 0x7D    }   Punct   Basic Latin
126 0x7E    ~   Punct   Basic Latin
127 0x7F        Control     Basic Latin

        UTF ANSI    OEM Type    Range   (ANSI conversion of OEM Character eg ® replaced by R)
128 0x80        €   ¼   Control     Control Codes
129 0x81            ü   Control     Control Codes
130 0x82        ‚   →   Control     Control Codes
131 0x83        ƒ   Æ   Control     Control Codes
132 0x84        „   ▲   Control     Control Codes
133 0x85        …   &   Control Space   Control Codes
134 0x86        †       Control     Control Codes
135 0x87        ‡   !   Control     Control Codes
136 0x88        ˆ   ã   Control     Control Codes
137 0x89        ‰   0   Control     Control Codes
138 0x8A        Š   `   Control     Control Codes
139 0x8B        ‹   9   Control     Control Codes
140 0x8C        Œ   R   Control     Control Codes
141 0x8D            ì   Control     Control Codes
142 0x8E        Ž   }   Control     Control Codes
143 0x8F            Å   Control     Control Codes
144 0x90            É   Control     Control Codes
145 0x91        ‘   ↑   Control     Control Codes
146 0x92        ’   ↓   Control     Control Codes
147 0x93        “   ∟   Control     Control Codes
148 0x94        ”   ↔   Control     Control Codes
149 0x95        •   "   Control     Control Codes
150 0x96        –   ‼   Control     Control Codes
151 0x97        —   ¶   Control     Control Codes
152 0x98        ˜   ▄   Control     Control Codes
153 0x99        ™   "   Control     Control Codes
154 0x9A        š   a   Control     Control Codes
155 0x9B        ›   :   Control     Control Codes
156 0x9C        œ   S   Control     Control Codes
157 0x9D            Ø   Control     Control Codes
158 0x9E        ž   ~   Control     Control Codes
159 0x9F        Ÿ   x   Control     Control Codes
160 0xA0            á   Blank Space     Latin-1 Supplement
161 0xA1    ¡   ¡   í   Punct   Latin-1 Supplement
162 0xA2    ¢   ¢   ó   Punct   Latin-1 Supplement
163 0xA3    £   £   ú   Punct   Latin-1 Supplement
164 0xA4    ¤   ¤   ñ   Punct   Latin-1 Supplement
165 0xA5    ¥   ¥   Ñ   Punct   Latin-1 Supplement
166 0xA6    ¦   ¦   ª   Punct   Latin-1 Supplement
167 0xA7    §   §   º   Punct   Latin-1 Supplement
168 0xA8    ¨   ¨   ¿   Punct   Latin-1 Supplement
169 0xA9    ©   ©   ®   Punct   Latin-1 Supplement
170 0xAA    ª   ª   ¬   Alpha Lower Punct   Latin-1 Supplement
171 0xAB    «   «   ½   Punct   Latin-1 Supplement
172 0xAC    ¬   ¬   ¼   Punct   Latin-1 Supplement
173 0xAD    ­   ­   ¡   Control Punct   Latin-1 Supplement
174 0xAE    ®   ®   «   Punct   Latin-1 Supplement
175 0xAF    ¯   ¯   »   Punct   Latin-1 Supplement
176 0xB0    °   °   ░   Punct   Latin-1 Supplement
177 0xB1    ±   ±   ▒   Punct   Latin-1 Supplement
178 0xB2    ²   ²   ▓   Number Punct    Latin-1 Supplement
179 0xB3    ³   ³   │   Number Punct    Latin-1 Supplement
180 0xB4    ´   ´   ┤   Punct   Latin-1 Supplement
181 0xB5    µ   µ   Á   Alpha Lower Punct   Latin-1 Supplement
182 0xB6    ¶   ¶   Â   Punct   Latin-1 Supplement
183 0xB7    ·   ·   À   Punct   Latin-1 Supplement
184 0xB8    ¸   ¸   ©   Punct   Latin-1 Supplement
185 0xB9    ¹   ¹   ╣   Number Punct    Latin-1 Supplement
186 0xBA    º   º   ║   Alpha Lower Punct   Latin-1 Supplement
187 0xBB    »   »   ╗   Punct   Latin-1 Supplement
188 0xBC    ¼   ¼   ╝   Punct   Latin-1 Supplement
189 0xBD    ½   ½   ¢   Punct   Latin-1 Supplement
190 0xBE    ¾   ¾   ¥   Punct   Latin-1 Supplement
191 0xBF    ¿   ¿   ┐   Punct   Latin-1 Supplement
192 0xC0    À   À   └   Alpha Upper     Latin-1 Supplement
193 0xC1    Á   Á   ┴   Alpha Upper     Latin-1 Supplement
194 0xC2    Â   Â   ┬   Alpha Upper     Latin-1 Supplement
195 0xC3    Ã   Ã   ├   Alpha Upper     Latin-1 Supplement
196 0xC4    Ä   Ä   ─   Alpha Upper     Latin-1 Supplement
197 0xC5    Å   Å   ┼   Alpha Upper     Latin-1 Supplement
198 0xC6    Æ   Æ   ã   Alpha Upper     Latin-1 Supplement
199 0xC7    Ç   Ç   Ã   Alpha Upper     Latin-1 Supplement
200 0xC8    È   È   ╚   Alpha Upper     Latin-1 Supplement
201 0xC9    É   É   ╔   Alpha Upper     Latin-1 Supplement
202 0xCA    Ê   Ê   ╩   Alpha Upper     Latin-1 Supplement
203 0xCB    Ë   Ë   ╦   Alpha Upper     Latin-1 Supplement
204 0xCC    Ì   Ì   ╠   Alpha Upper     Latin-1 Supplement
205 0xCD    Í   Í   ═   Alpha Upper     Latin-1 Supplement
206 0xCE    Î   Î   ╬   Alpha Upper     Latin-1 Supplement
207 0xCF    Ï   Ï   ¤   Alpha Upper     Latin-1 Supplement
208 0xD0    Ð   Ð   ð   Alpha Upper     Latin-1 Supplement
209 0xD1    Ñ   Ñ   Ð   Alpha Upper     Latin-1 Supplement
210 0xD2    Ò   Ò   Ê   Alpha Upper     Latin-1 Supplement
211 0xD3    Ó   Ó   Ë   Alpha Upper     Latin-1 Supplement
212 0xD4    Ô   Ô   È   Alpha Upper     Latin-1 Supplement
213 0xD5    Õ   Õ   ı   Alpha Upper     Latin-1 Supplement
214 0xD6    Ö   Ö   Í   Alpha Upper     Latin-1 Supplement
215 0xD7    ×   ×   Î   Punct   Latin-1 Supplement
216 0xD8    Ø   Ø   Ï   Alpha Upper     Latin-1 Supplement
217 0xD9    Ù   Ù   ┘   Alpha Upper     Latin-1 Supplement
218 0xDA    Ú   Ú   ┌   Alpha Upper     Latin-1 Supplement
219 0xDB    Û   Û   █   Alpha Upper     Latin-1 Supplement
220 0xDC    Ü   Ü   ▄   Alpha Upper     Latin-1 Supplement
221 0xDD    Ý   Ý   ¦   Alpha Upper     Latin-1 Supplement
222 0xDE    Þ   Þ   Ì   Alpha Upper     Latin-1 Supplement
223 0xDF    ß   ß   ▀   Alpha Lower     Latin-1 Supplement
224 0xE0    à   à   Ó   Alpha Lower     Latin-1 Supplement
225 0xE1    á   á   ß   Alpha Lower     Latin-1 Supplement
226 0xE2    â   â   Ô   Alpha Lower     Latin-1 Supplement
227 0xE3    ã   ã   Ò   Alpha Lower     Latin-1 Supplement
228 0xE4    ä   ä   õ   Alpha Lower     Latin-1 Supplement
229 0xE5    å   å   Õ   Alpha Lower     Latin-1 Supplement
230 0xE6    æ   æ   µ   Alpha Lower     Latin-1 Supplement
231 0xE7    ç   ç   þ   Alpha Lower     Latin-1 Supplement
232 0xE8    è   è   Þ   Alpha Lower     Latin-1 Supplement
233 0xE9    é   é   Ú   Alpha Lower     Latin-1 Supplement
234 0xEA    ê   ê   Û   Alpha Lower     Latin-1 Supplement
235 0xEB    ë   ë   Ù   Alpha Lower     Latin-1 Supplement
236 0xEC    ì   ì   ý   Alpha Lower     Latin-1 Supplement
237 0xED    í   í   Ý   Alpha Lower     Latin-1 Supplement
238 0xEE    î   î   ¯   Alpha Lower     Latin-1 Supplement
239 0xEF    ï   ï   ´   Alpha Lower     Latin-1 Supplement
240 0xF0    ð   ð   ­   Alpha Lower     Latin-1 Supplement
241 0xF1    ñ   ñ   ±   Alpha Lower     Latin-1 Supplement
242 0xF2    ò   ò   ‗   Alpha Lower     Latin-1 Supplement
243 0xF3    ó   ó   ¾   Alpha Lower     Latin-1 Supplement
244 0xF4    ô   ô   ¶   Alpha Lower     Latin-1 Supplement
245 0xF5    õ   õ   §   Alpha Lower     Latin-1 Supplement
246 0xF6    ö   ö   ÷   Alpha Lower     Latin-1 Supplement
247 0xF7    ÷   ÷   ¸   Punct   Latin-1 Supplement
248 0xF8    ø   ø   °   Alpha Lower     Latin-1 Supplement
249 0xF9    ù   ù   ¨   Alpha Lower     Latin-1 Supplement
250 0xFA    ú   ú   ·   Alpha Lower     Latin-1 Supplement
251 0xFB    û   û   ¹   Alpha Lower     Latin-1 Supplement
252 0xFC    ü   ü   ³   Alpha Lower     Latin-1 Supplement
253 0xFD    ý   ý   ²   Alpha Lower     Latin-1 Supplement
254 0xFE    þ   þ   ■   Alpha Lower     Latin-1 Supplement
255 0xFF    ÿ   ÿ       Alpha Lower     Latin-1 Supplement



来源:https://stackoverflow.com/questions/59110563/different-behaviour-and-output-when-piping-through-cmd-and-powershell

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!