Unable to change encoding of text files in Windows

旧巷老猫 提交于 2021-02-08 08:19:10

问题


I have some text files with different encodings. Some of them are UTF-8 and some others are windows-1251 encoded. I tried to execute following recursive script to encode it all to UTF-8.

Get-ChildItem *.nfo -Recurse | ForEach-Object {
$content = $_ | Get-Content

Set-Content -PassThru $_.Fullname $content -Encoding UTF8 -Force}  

After that I am unable to use files in my Java program, because UTF-8 encoded has also wrong encoding, I couldn't get back original text. In case of windows-1251 encoded files I get empty output as in case of original files. So it makes corrupt already UTF-8 encoded files.

I found another solution, iconv, but as I see it needs current encoding as parameter.

$ iconv options -f from-encoding -t to-encoding inputfile(s) -o outputfile 

Differently encoded files are mixed in a folder structure, so files should stay on same path.

System uses Code page 852. Existing UTF-8 files are without BOM.


回答1:


In Windows PowerShell you won't be able to use the built-in cmdlets for two reasons:

  • From your OEM code page being 852 I infer that your "ANSI" code page is Windows-1250 (both defined by the legacy system locale), which doesn't match your Windows-1251-encoded input files.

  • Using Set-Content (and similar) with -Encoding UTF8 invariably creates files with a BOM (byte-order mark), which Java and, more generally, Unix-heritage utilities don't understand.

Note: PowerShell Core actually defaults to BOM-less UTF8 and also allows you to pass any available [System.Text.Encoding] instance to the -Encoding parameter, so you could solve your problem with the built-in cmdlets there, while needing direct use of the .NET framework only to construct an encoding instance.

You must therefore use the .NET framework directly:

Get-ChildItem *.nfo -Recurse | ForEach-Object {

  $file = $_.FullName

  $mustReWrite = $false
  # Try to read as UTF-8 first and throw an exception if 
  # invalid-as-UTF-8 bytes are encountered.
  try {
    [IO.File]::ReadAllText($file, [Text.Utf8Encoding]::new($false, $true))
  } catch [System.Text.DecoderFallbackException] {
    # Fall back to Windows-1251
    $content = [IO.File]::ReadAllText($file, [Text.Encoding]::GetEncoding(1251))
    $mustReWrite = $true
  } 

  # Rewrite as UTF-8 without BOM (the .NET frameworks' default)
  if ($mustReWrite) {
    Write-Verbose "Converting from 1251 to UTF-8: $file"
    [IO.File]::WriteAllText($file, $content)
  } else {
    Write-Verbose "Already UTF-8-encoded: $file"
  }

}

Note: As in your own attempt, the above solution reads each file into memory as a whole, but that could be changed.

Note:

  • If an input file comprises only bytes with ASCII-range characters (7-bit), it is by definition also UTF-8-encoded, because UTF-8 is a superset of ASCII encoding.

  • It is highly unlikely with real-world input, but purely technically a Windows-1251-encoded file could be a valid UTF-8 file as well, if the bit patterns and byte sequences happen to be valid UTF-8 (which has strict rules around what bit patterns are allowed where).
    Such a file would not contain meaningful Windows-1251 content, however.

  • There is no reason to implement a fallback strategy for decoding with Windows-1251, because there is no technical restrictions on what bit patterns can occur where.
    Generally, in the absence of external information (or a BOM), there's no simple and no robust way to infer a file's encoding just from its content (though heuristics can be employed).



来源:https://stackoverflow.com/questions/53282171/unable-to-change-encoding-of-text-files-in-windows

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!