Text Encoding between Linux and Windows

拥有回忆 提交于 2021-02-20 00:14:54

问题


The main question I have is how can I get a textfile that I have in Linux to display properly in PowerShell.

In Linux, I have text files with some special characters, and in fact Notepad displays the text file exactly as it is displayed in Linux:

Unfortunately, my program prints to my Linux Terminal, and thus I need the same output in my Windows terminal. I have seen through other answers that

  1. I need to use a TrueType font, so I am using Lucidia Console
  2. on my Linux device, the encoding is UTF-8. According to every answer I can find online, CHCP 65001 switches the code page in PowerShell to UTF-8
  3. Windows Powershell is better equipped to display content, so while I have tried using the command prompt, I am now working in PowerShell.

Using CHCP 65001 and then typing

more my_file.txt

displays this:

while using

Get-Content -Encoding UTF8 my_file.txt

outputs:

Neither of these results is good enough, but I am actually concerned that Get-Content does something different at all here. The code that I am transferring to windows is written in Free Pascal, and in Free Pascal, I can provide a UTF-8 codepage, but that's it. So while Get-Content is a good command for me to check if PowerShell is capable of producing the desired output, it is not practical for me to use it. In Pascal, the output (which is written to the PowerShell display) appears as:

Which is bad as well, those lines should connect because they do in Linux (and obviously some characters are interpreted just as ?). However, this might be a problem with the codepage picked in Pascal, which would be a next step.

My question right now is, how can I get the Windows Powershell to, by default, display a text file as it is shown in the notepad version. It is not practical for me to run Get-Content in my code everywhere, so although that result appears more promising, I cannot follow that.

As a follow up question, because I could not find it anywhere online, what are the main players here when it comes to displaying content, because it is clearly a bigger story than just the encoding. Why are the 'more' and the 'Get-Content' commands displaying different outputs? And why can 'Get-Content' not read all of the content? I had assumed UTF-8 was a universal standard, and programs who can read UTF-8 could at least actually read all of the characters, but they're all reading it differently.

The input, as text, is:

    ╭─────╮
    │     │
  ╭─│───╮ │
  │ │   │ │
  │ │ ╭─│───╮
  │ │ │ │ │ │
╭─│───│─╯ │ │
│ │ │ │   │ │
│ │ ╰─╯   │ │
│ │       │ │
│ ╰───────│─╯
│         │
╰─────────╯

In response to an answer posted below, I can see that

more my_file.txt

produces

when using

$OutputEncoding = [console]::InputEncoding = [console]::OutputEncoding = 
  New-Object System.Text.UTF8Encoding 

回答1:


  • Make sure that your UTF-8-encoded text file has a BOM - otherwise, your file will be misinterpreted by Windows PowerShell as being encoded based on the system's active ANSI code page (whereas PowerShell [Core] 6+ now thankfully consistently defaults to UTF-8 in the absence of a BOM).

    • Alternatively, use Get-Content -Encoding Utf8 my_file.txt to explicitly specify the file's encoding.

    • For a comprehensive discussion of character encoding in Windows PowerShell vs. PowerShell [Core], see this answer.

  • For output from external programs to be correctly captured in a variable or correctly redirect to a file, you need to set [Console]::OutputEncoding to the character encoding that the given program uses on output (for mere printing to the display this may not be necessary, however):

    • If code page 65001 (UTF-8) is in effect and your program honors that, you'll need to set [Console]::OutputEncoding = New-Object System.Text.UTF8Encoding; see below for how to ensure that 65001 is truly in effect, given that running chcp 65001 from inside PowerShell is not effective.

    • You mention FreePascal, whose Unicode support is described here.
      However, your screen shot implies that your FreePascal program's output is not UTF-8, because the rounded-corner characters were transcoded to ? characters (which suggests a lossy transcoding to the system's OEM code page, where these characters aren't present).

    • Therefore, to solve your problem you must ensure that your FreePascal program either unconditionally outputs UTF-8 or honors the active code page (as reported by chcp), assuming you've first set it to 65001 (the UTF-8 code page; see below).

  • Choose a font that can render the rounded-corner Unicode characters (such as (U+256D) in your console window; the Windows PowerShell default font, Lucinda Console, can not (it renders , as shown in your question), but Consolas, for instance (which PowerShell [Core] 6+ uses by default), can.


Using UTF-8 encoding with external programs consistently:

Note:

  • The command below is neither necessary for nor does it have any effect on PowerShell commands such as the Get-Content cmdlet.

  • Some legacy console applications - notably more.com (which Windows PowerShell wraps in a more function) - fundamentally do not support Unicode, only the legacy OEM code pages.[*]

According to every answer I can find online, CHCP 65001 switches the code page in PowerShell to UTF-8

chcp 65001 does not work if run from within PowerShell, because .NET caches the [Console]::OutputEncoding value at PowerShell session startup, with the code page that was in effect at that time.

Instead, you can use the following to fully make a console window UTF-8 aware (which implicitly also makes chcp report 65001 afterwards):

$OutputEncoding = [console]::InputEncoding = [console]::OutputEncoding =
                    New-Object System.Text.UTF8Encoding

This makes PowerShell interpret an external program's output as UTF-8, and also encodes data it sends to external program as UTF-8 (thanks to preference variable $OutputEncoding).

See this answer for more information.


[*] With the UTF-8 code page 65001 in effect, more quietly skips lines that contain at least one Unicode character that cannot be mapped onto the system's OEM code page (any character not present in the system's single-byte OEM code page, which can only represent 256 characters), which in this case applies to the lines that contain the rounded-corner characters such as (BOX DRAWINGS LIGHT ARC DOWN AND RIGHT, U+256D).



来源:https://stackoverflow.com/questions/60727039/text-encoding-between-linux-and-windows

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!