问题
I am trying to convert an XML file from Latin1 to UTF-8 and the other way around. I have been doing some tests, but I fail to succeed this. I'm using
Get-Content C:\inputfile.xml | Set-Content -Encoding utf8 C:\outputfile.xml
But this is not converting anything. So I tried to give the encoding in the Get-Content
, but Latin1 is not recognized in PowerShell (or that's what the error message is telling me).
What's the best way to get this?
回答1:
The fastest method, especially with large XML files, is to use .NET System.IO.File class.
Use ReadAllText with explicitly provided Latin-1 encoding:
[IO.File]::ReadAllText('r:\1.txt', [Text.Encoding]::GetEncoding('iso-8859-1')) | Set-Content r:\2.txt -Encoding UTF8
If your xml file has
<?xml version="1.0" encoding="iso-8859-1" ?>
it needs to be changed too:[IO.File]::ReadAllText('r:\1.txt', [Text.Encoding]::GetEncoding('iso-8859-1')). Replace('<?xml version="1.0" encoding="iso-8859-1"', '<?xml version="1.0" encoding="UTF-8"') | Set-Content r:\2.txt -Encoding UTF8
To write Latin-1 encoding use WriteAllText with explicitly provided Latin-1 encoding:
[IO.File]::WriteAllText( 'r:\2.txt', [IO.File]::ReadAllText('r:\1.txt', [Text.Encoding]::UTF8). Replace('<?xml version="1.0" encoding="UTF-8"', '<?xml version="1.0" encoding="iso-8859-1"'), [Text.Encoding]::GetEncoding('iso-8859-1') )
Memory-efficient transcoding that can process files of any size (1TB? no problem!):
function transcodeXML( [ValidateScript({Test-Path -Literal $_})] [string]$source, [ValidateSet('IBM037', 'IBM437', 'IBM500', 'ASMO-708', 'DOS-720', 'ibm737', 'ibm775', 'ibm850', 'ibm852', 'IBM855', 'ibm857', 'IBM00858', 'IBM860', 'ibm861', 'DOS-862', 'IBM863', 'IBM864', 'IBM865', 'cp866', 'ibm869', 'IBM870', 'windows-874', 'cp875', 'shift_jis', 'gb2312', 'ks_c_5601-1987', 'big5', 'IBM1026', 'IBM01047', 'IBM01140', 'IBM01141', 'IBM01142', 'IBM01143', 'IBM01144', 'IBM01145', 'IBM01146', 'IBM01147', 'IBM01148', 'IBM01149', 'utf-16', 'utf-16BE', 'windows-1250', 'windows-1251', 'Windows-1252', 'windows-1253', 'windows-1254', 'windows-1255', 'windows-1256', 'windows-1257', 'windows-1258', 'Johab', 'macintosh', 'x-mac-japanese', 'x-mac-chinesetrad', 'x-mac-korean', 'x-mac-arabic', 'x-mac-hebrew', 'x-mac-greek', 'x-mac-cyrillic', 'x-mac-chinesesimp', 'x-mac-romanian', 'x-mac-ukrainian', 'x-mac-thai', 'x-mac-ce', 'x-mac-icelandic', 'x-mac-turkish', 'x-mac-croatian', 'utf-32', 'utf-32BE', 'x-Chinese-CNS', 'x-cp20001', 'x-Chinese-Eten', 'x-cp20003', 'x-cp20004', 'x-cp20005', 'x-IA5', 'x-IA5-German', 'x-IA5-Swedish', 'x-IA5-Norwegian', 'us-ascii', 'x-cp20261', 'x-cp20269', 'IBM273', 'IBM277', 'IBM278', 'IBM280', 'IBM284', 'IBM285', 'IBM290', 'IBM297', 'IBM420', 'IBM423', 'IBM424', 'x-EBCDIC-KoreanExtended', 'IBM-Thai', 'koi8-r', 'IBM871', 'IBM880', 'IBM905', 'IBM00924', 'EUC-JP', 'x-cp20936', 'x-cp20949', 'cp1025', 'koi8-u', 'iso-8859-1', 'iso-8859-2', 'iso-8859-3', 'iso-8859-4', 'iso-8859-5', 'iso-8859-6', 'iso-8859-7', 'iso-8859-8', 'iso-8859-9', 'iso-8859-13', 'iso-8859-15', 'x-Europa', 'iso-8859-8-i', 'iso-2022-jp', 'csISO2022JP', 'iso-2022-jp', 'iso-2022-kr', 'x-cp50227', 'euc-jp', 'EUC-CN', 'euc-kr', 'hz-gb-2312', 'GB18030', 'x-iscii-de', 'x-iscii-be', 'x-iscii-ta', 'x-iscii-te', 'x-iscii-as', 'x-iscii-or', 'x-iscii-ka', 'x-iscii-ma', 'x-iscii-gu', 'x-iscii-pa', 'utf-7', 'utf-8')] [string]$sourceEncoding, [ValidateScript({Test-Path -Literal $_ -IsValid})] [string]$target, [ValidateSet('IBM037', 'IBM437', 'IBM500', 'ASMO-708', 'DOS-720', 'ibm737', 'ibm775', 'ibm850', 'ibm852', 'IBM855', 'ibm857', 'IBM00858', 'IBM860', 'ibm861', 'DOS-862', 'IBM863', 'IBM864', 'IBM865', 'cp866', 'ibm869', 'IBM870', 'windows-874', 'cp875', 'shift_jis', 'gb2312', 'ks_c_5601-1987', 'big5', 'IBM1026', 'IBM01047', 'IBM01140', 'IBM01141', 'IBM01142', 'IBM01143', 'IBM01144', 'IBM01145', 'IBM01146', 'IBM01147', 'IBM01148', 'IBM01149', 'utf-16', 'utf-16BE', 'windows-1250', 'windows-1251', 'Windows-1252', 'windows-1253', 'windows-1254', 'windows-1255', 'windows-1256', 'windows-1257', 'windows-1258', 'Johab', 'macintosh', 'x-mac-japanese', 'x-mac-chinesetrad', 'x-mac-korean', 'x-mac-arabic', 'x-mac-hebrew', 'x-mac-greek', 'x-mac-cyrillic', 'x-mac-chinesesimp', 'x-mac-romanian', 'x-mac-ukrainian', 'x-mac-thai', 'x-mac-ce', 'x-mac-icelandic', 'x-mac-turkish', 'x-mac-croatian', 'utf-32', 'utf-32BE', 'x-Chinese-CNS', 'x-cp20001', 'x-Chinese-Eten', 'x-cp20003', 'x-cp20004', 'x-cp20005', 'x-IA5', 'x-IA5-German', 'x-IA5-Swedish', 'x-IA5-Norwegian', 'us-ascii', 'x-cp20261', 'x-cp20269', 'IBM273', 'IBM277', 'IBM278', 'IBM280', 'IBM284', 'IBM285', 'IBM290', 'IBM297', 'IBM420', 'IBM423', 'IBM424', 'x-EBCDIC-KoreanExtended', 'IBM-Thai', 'koi8-r', 'IBM871', 'IBM880', 'IBM905', 'IBM00924', 'EUC-JP', 'x-cp20936', 'x-cp20949', 'cp1025', 'koi8-u', 'iso-8859-1', 'iso-8859-2', 'iso-8859-3', 'iso-8859-4', 'iso-8859-5', 'iso-8859-6', 'iso-8859-7', 'iso-8859-8', 'iso-8859-9', 'iso-8859-13', 'iso-8859-15', 'x-Europa', 'iso-8859-8-i', 'iso-2022-jp', 'csISO2022JP', 'iso-2022-jp', 'iso-2022-kr', 'x-cp50227', 'euc-jp', 'EUC-CN', 'euc-kr', 'hz-gb-2312', 'GB18030', 'x-iscii-de', 'x-iscii-be', 'x-iscii-ta', 'x-iscii-te', 'x-iscii-as', 'x-iscii-or', 'x-iscii-ka', 'x-iscii-ma', 'x-iscii-gu', 'x-iscii-pa', 'utf-7', 'utf-8')] [string]$targetEncoding ) { $reader = [IO.StreamReader]::new( $source, [Text.Encoding]::GetEncoding($sourceEncoding) ) $writer = [IO.StreamWriter]::new( $target, $false, # don't append = overwrite [Text.Encoding]::GetEncoding($targetEncoding) ) $buf = [char[]]::new(16MB) $nRead = $reader.Read($buf, 0, $buf.Length) $writer.Write( ([regex]"(<\?xml [^>]*?encoding="")(?i)$sourceEncoding(?="")").Replace( [string]::new($buf, 0, $nRead), '$1' + $targetEncoding, 1 # speedup: one replacement only ) ) while (!$reader.EndOfStream) { $nRead = $reader.Read($buf, 0, $buf.Length) $writer.Write($buf, 0, $nRead) } $reader.Close() $writer.Close() }
Usage:
transcodeXML 'r:\1.xml' iso-8859-1 'r:\2.xml' utf-8
回答2:
I would suggest to pull the XML into an System.Xml.Linq.XDocument
with the Load
method and then change the Encoding
property of the Declaration
property (https://msdn.microsoft.com/en-us/library/system.xml.linq.xdocument.declaration(v=vs.110).aspx) of that XDocument
as needed or add one if Declaration
is null and the finally you can use the Save
method to save the changed document.
来源:https://stackoverflow.com/questions/39869353/convert-xml-latin1-to-utf-8-and-other-way-around