I am trying to convert an XML file from Latin1 to UTF-8 and the other way around. I have been doing some tests, but I fail to succeed this. I'm using
Get-Content C:\inputfile.xml | Set-Content -Encoding utf8 C:\outputfile.xml
But this is not converting anything. So I tried to give the encoding in the Get-Content, but Latin1 is not recognized in PowerShell (or that's what the error message is telling me).
What's the best way to get this?
The fastest method, especially with large XML files, is to use .NET System.IO.File class.
Use ReadAllText with explicitly provided Latin-1 encoding:
[IO.File]::ReadAllText('r:\1.txt', [Text.Encoding]::GetEncoding('iso-8859-1')) | Set-Content r:\2.txt -Encoding UTF8If your xml file has
<?xml version="1.0" encoding="iso-8859-1" ?>it needs to be changed too:[IO.File]::ReadAllText('r:\1.txt', [Text.Encoding]::GetEncoding('iso-8859-1')). Replace('<?xml version="1.0" encoding="iso-8859-1"', '<?xml version="1.0" encoding="UTF-8"') | Set-Content r:\2.txt -Encoding UTF8To write Latin-1 encoding use WriteAllText with explicitly provided Latin-1 encoding:
[IO.File]::WriteAllText( 'r:\2.txt', [IO.File]::ReadAllText('r:\1.txt', [Text.Encoding]::UTF8). Replace('<?xml version="1.0" encoding="UTF-8"', '<?xml version="1.0" encoding="iso-8859-1"'), [Text.Encoding]::GetEncoding('iso-8859-1') )Memory-efficient transcoding that can process files of any size (1TB? no problem!):
function transcodeXML( [ValidateScript({Test-Path -Literal $_})] [string]$source, [ValidateSet('IBM037', 'IBM437', 'IBM500', 'ASMO-708', 'DOS-720', 'ibm737', 'ibm775', 'ibm850', 'ibm852', 'IBM855', 'ibm857', 'IBM00858', 'IBM860', 'ibm861', 'DOS-862', 'IBM863', 'IBM864', 'IBM865', 'cp866', 'ibm869', 'IBM870', 'windows-874', 'cp875', 'shift_jis', 'gb2312', 'ks_c_5601-1987', 'big5', 'IBM1026', 'IBM01047', 'IBM01140', 'IBM01141', 'IBM01142', 'IBM01143', 'IBM01144', 'IBM01145', 'IBM01146', 'IBM01147', 'IBM01148', 'IBM01149', 'utf-16', 'utf-16BE', 'windows-1250', 'windows-1251', 'Windows-1252', 'windows-1253', 'windows-1254', 'windows-1255', 'windows-1256', 'windows-1257', 'windows-1258', 'Johab', 'macintosh', 'x-mac-japanese', 'x-mac-chinesetrad', 'x-mac-korean', 'x-mac-arabic', 'x-mac-hebrew', 'x-mac-greek', 'x-mac-cyrillic', 'x-mac-chinesesimp', 'x-mac-romanian', 'x-mac-ukrainian', 'x-mac-thai', 'x-mac-ce', 'x-mac-icelandic', 'x-mac-turkish', 'x-mac-croatian', 'utf-32', 'utf-32BE', 'x-Chinese-CNS', 'x-cp20001', 'x-Chinese-Eten', 'x-cp20003', 'x-cp20004', 'x-cp20005', 'x-IA5', 'x-IA5-German', 'x-IA5-Swedish', 'x-IA5-Norwegian', 'us-ascii', 'x-cp20261', 'x-cp20269', 'IBM273', 'IBM277', 'IBM278', 'IBM280', 'IBM284', 'IBM285', 'IBM290', 'IBM297', 'IBM420', 'IBM423', 'IBM424', 'x-EBCDIC-KoreanExtended', 'IBM-Thai', 'koi8-r', 'IBM871', 'IBM880', 'IBM905', 'IBM00924', 'EUC-JP', 'x-cp20936', 'x-cp20949', 'cp1025', 'koi8-u', 'iso-8859-1', 'iso-8859-2', 'iso-8859-3', 'iso-8859-4', 'iso-8859-5', 'iso-8859-6', 'iso-8859-7', 'iso-8859-8', 'iso-8859-9', 'iso-8859-13', 'iso-8859-15', 'x-Europa', 'iso-8859-8-i', 'iso-2022-jp', 'csISO2022JP', 'iso-2022-jp', 'iso-2022-kr', 'x-cp50227', 'euc-jp', 'EUC-CN', 'euc-kr', 'hz-gb-2312', 'GB18030', 'x-iscii-de', 'x-iscii-be', 'x-iscii-ta', 'x-iscii-te', 'x-iscii-as', 'x-iscii-or', 'x-iscii-ka', 'x-iscii-ma', 'x-iscii-gu', 'x-iscii-pa', 'utf-7', 'utf-8')] [string]$sourceEncoding, [ValidateScript({Test-Path -Literal $_ -IsValid})] [string]$target, [ValidateSet('IBM037', 'IBM437', 'IBM500', 'ASMO-708', 'DOS-720', 'ibm737', 'ibm775', 'ibm850', 'ibm852', 'IBM855', 'ibm857', 'IBM00858', 'IBM860', 'ibm861', 'DOS-862', 'IBM863', 'IBM864', 'IBM865', 'cp866', 'ibm869', 'IBM870', 'windows-874', 'cp875', 'shift_jis', 'gb2312', 'ks_c_5601-1987', 'big5', 'IBM1026', 'IBM01047', 'IBM01140', 'IBM01141', 'IBM01142', 'IBM01143', 'IBM01144', 'IBM01145', 'IBM01146', 'IBM01147', 'IBM01148', 'IBM01149', 'utf-16', 'utf-16BE', 'windows-1250', 'windows-1251', 'Windows-1252', 'windows-1253', 'windows-1254', 'windows-1255', 'windows-1256', 'windows-1257', 'windows-1258', 'Johab', 'macintosh', 'x-mac-japanese', 'x-mac-chinesetrad', 'x-mac-korean', 'x-mac-arabic', 'x-mac-hebrew', 'x-mac-greek', 'x-mac-cyrillic', 'x-mac-chinesesimp', 'x-mac-romanian', 'x-mac-ukrainian', 'x-mac-thai', 'x-mac-ce', 'x-mac-icelandic', 'x-mac-turkish', 'x-mac-croatian', 'utf-32', 'utf-32BE', 'x-Chinese-CNS', 'x-cp20001', 'x-Chinese-Eten', 'x-cp20003', 'x-cp20004', 'x-cp20005', 'x-IA5', 'x-IA5-German', 'x-IA5-Swedish', 'x-IA5-Norwegian', 'us-ascii', 'x-cp20261', 'x-cp20269', 'IBM273', 'IBM277', 'IBM278', 'IBM280', 'IBM284', 'IBM285', 'IBM290', 'IBM297', 'IBM420', 'IBM423', 'IBM424', 'x-EBCDIC-KoreanExtended', 'IBM-Thai', 'koi8-r', 'IBM871', 'IBM880', 'IBM905', 'IBM00924', 'EUC-JP', 'x-cp20936', 'x-cp20949', 'cp1025', 'koi8-u', 'iso-8859-1', 'iso-8859-2', 'iso-8859-3', 'iso-8859-4', 'iso-8859-5', 'iso-8859-6', 'iso-8859-7', 'iso-8859-8', 'iso-8859-9', 'iso-8859-13', 'iso-8859-15', 'x-Europa', 'iso-8859-8-i', 'iso-2022-jp', 'csISO2022JP', 'iso-2022-jp', 'iso-2022-kr', 'x-cp50227', 'euc-jp', 'EUC-CN', 'euc-kr', 'hz-gb-2312', 'GB18030', 'x-iscii-de', 'x-iscii-be', 'x-iscii-ta', 'x-iscii-te', 'x-iscii-as', 'x-iscii-or', 'x-iscii-ka', 'x-iscii-ma', 'x-iscii-gu', 'x-iscii-pa', 'utf-7', 'utf-8')] [string]$targetEncoding ) { $reader = [IO.StreamReader]::new( $source, [Text.Encoding]::GetEncoding($sourceEncoding) ) $writer = [IO.StreamWriter]::new( $target, $false, # don't append = overwrite [Text.Encoding]::GetEncoding($targetEncoding) ) $buf = [char[]]::new(16MB) $nRead = $reader.Read($buf, 0, $buf.Length) $writer.Write( ([regex]"(<\?xml [^>]*?encoding="")(?i)$sourceEncoding(?="")").Replace( [string]::new($buf, 0, $nRead), '$1' + $targetEncoding, 1 # speedup: one replacement only ) ) while (!$reader.EndOfStream) { $nRead = $reader.Read($buf, 0, $buf.Length) $writer.Write($buf, 0, $nRead) } $reader.Close() $writer.Close() }Usage:
transcodeXML 'r:\1.xml' iso-8859-1 'r:\2.xml' utf-8
I would suggest to pull the XML into an System.Xml.Linq.XDocument with the Load method and then change the Encoding property of the Declaration property (https://msdn.microsoft.com/en-us/library/system.xml.linq.xdocument.declaration(v=vs.110).aspx) of that XDocument as needed or add one if Declaration is null and the finally you can use the Save method to save the changed document.
来源:https://stackoverflow.com/questions/39869353/convert-xml-latin1-to-utf-8-and-other-way-around