问题
I have some text files with different encodings. Some of them are UTF-8
and some others are windows-1251
encoded. I tried to execute following recursive script to encode it all to UTF-8
.
Get-ChildItem *.nfo -Recurse | ForEach-Object {
$content = $_ | Get-Content
Set-Content -PassThru $_.Fullname $content -Encoding UTF8 -Force}
After that I am unable to use files in my Java program, because UTF-8 encoded has also wrong encoding, I couldn't get back original text. In case of windows-1251 encoded files I get empty output as in case of original files. So it makes corrupt already UTF-8 encoded files.
I found another solution, iconv
, but as I see it needs current encoding as parameter.
$ iconv options -f from-encoding -t to-encoding inputfile(s) -o outputfile
Differently encoded files are mixed in a folder structure, so files should stay on same path.
System uses Code page 852. Existing UTF-8 files are without BOM.
回答1:
In Windows PowerShell you won't be able to use the built-in cmdlets for two reasons:
From your OEM code page being
852
I infer that your "ANSI" code page isWindows-1250
(both defined by the legacy system locale), which doesn't match yourWindows-1251
-encoded input files.Using
Set-Content
(and similar) with-Encoding UTF8
invariably creates files with a BOM (byte-order mark), which Java and, more generally, Unix-heritage utilities don't understand.
Note: PowerShell Core actually defaults to BOM-less UTF8 and also allows you to pass any available [System.Text.Encoding]
instance to the -Encoding
parameter, so you could solve your problem with the built-in cmdlets there, while needing direct use of the .NET framework only to construct an encoding instance.
You must therefore use the .NET framework directly:
Get-ChildItem *.nfo -Recurse | ForEach-Object {
$file = $_.FullName
$mustReWrite = $false
# Try to read as UTF-8 first and throw an exception if
# invalid-as-UTF-8 bytes are encountered.
try {
[IO.File]::ReadAllText($file, [Text.Utf8Encoding]::new($false, $true))
} catch [System.Text.DecoderFallbackException] {
# Fall back to Windows-1251
$content = [IO.File]::ReadAllText($file, [Text.Encoding]::GetEncoding(1251))
$mustReWrite = $true
}
# Rewrite as UTF-8 without BOM (the .NET frameworks' default)
if ($mustReWrite) {
Write-Verbose "Converting from 1251 to UTF-8: $file"
[IO.File]::WriteAllText($file, $content)
} else {
Write-Verbose "Already UTF-8-encoded: $file"
}
}
Note: As in your own attempt, the above solution reads each file into memory as a whole, but that could be changed.
Note:
If an input file comprises only bytes with ASCII-range characters (7-bit), it is by definition also UTF-8-encoded, because UTF-8 is a superset of ASCII encoding.
It is highly unlikely with real-world input, but purely technically a Windows-1251-encoded file could be a valid UTF-8 file as well, if the bit patterns and byte sequences happen to be valid UTF-8 (which has strict rules around what bit patterns are allowed where).
Such a file would not contain meaningful Windows-1251 content, however.There is no reason to implement a fallback strategy for decoding with Windows-1251, because there is no technical restrictions on what bit patterns can occur where.
Generally, in the absence of external information (or a BOM), there's no simple and no robust way to infer a file's encoding just from its content (though heuristics can be employed).
来源:https://stackoverflow.com/questions/53282171/unable-to-change-encoding-of-text-files-in-windows