I am trying to convert an XML file from Latin1 to UTF-8 and the other way around. I have been doing some tests, but I fail to succeed this. I'm using
Get-Content C:\inputfile.xml | Set-Content -Encoding utf8 C:\outputfile.xml
But this is not converting anything. So I tried to give the encoding in the Get-Content
, but Latin1 is not recognized in PowerShell (or that's what the error message is telling me).
What's the best way to get this?
The fastest method, especially with large XML files, is to use .NET System.IO.File class.
Use ReadAllText with explicitly provided Latin-1 encoding:
[IO.File]::ReadAllText('r:\1.txt', [Text.Encoding]::GetEncoding('iso-8859-1')) | Set-Content r:\2.txt -Encoding UTF8
If your xml file has
<?xml version="1.0" encoding="iso-8859-1" ?>
it needs to be changed too:[IO.File]::ReadAllText('r:\1.txt', [Text.Encoding]::GetEncoding('iso-8859-1')). Replace('<?xml version="1.0" encoding="iso-8859-1"', '<?xml version="1.0" encoding="UTF-8"') | Set-Content r:\2.txt -Encoding UTF8
To write Latin-1 encoding use WriteAllText with explicitly provided Latin-1 encoding:
[IO.File]::WriteAllText( 'r:\2.txt', [IO.File]::ReadAllText('r:\1.txt', [Text.Encoding]::UTF8). Replace('<?xml version="1.0" encoding="UTF-8"', '<?xml version="1.0" encoding="iso-8859-1"'), [Text.Encoding]::GetEncoding('iso-8859-1') )
Memory-efficient transcoding that can process files of any size (1TB? no problem!):
function transcodeXML( [ValidateScript({Test-Path -Literal $_})] [string]$source, [ValidateSet('IBM037', 'IBM437', 'IBM500', 'ASMO-708', 'DOS-720', 'ibm737', 'ibm775', 'ibm850', 'ibm852', 'IBM855', 'ibm857', 'IBM00858', 'IBM860', 'ibm861', 'DOS-862', 'IBM863', 'IBM864', 'IBM865', 'cp866', 'ibm869', 'IBM870', 'windows-874', 'cp875', 'shift_jis', 'gb2312', 'ks_c_5601-1987', 'big5', 'IBM1026', 'IBM01047', 'IBM01140', 'IBM01141', 'IBM01142', 'IBM01143', 'IBM01144', 'IBM01145', 'IBM01146', 'IBM01147', 'IBM01148', 'IBM01149', 'utf-16', 'utf-16BE', 'windows-1250', 'windows-1251', 'Windows-1252', 'windows-1253', 'windows-1254', 'windows-1255', 'windows-1256', 'windows-1257', 'windows-1258', 'Johab', 'macintosh', 'x-mac-japanese', 'x-mac-chinesetrad', 'x-mac-korean', 'x-mac-arabic', 'x-mac-hebrew', 'x-mac-greek', 'x-mac-cyrillic', 'x-mac-chinesesimp', 'x-mac-romanian', 'x-mac-ukrainian', 'x-mac-thai', 'x-mac-ce', 'x-mac-icelandic', 'x-mac-turkish', 'x-mac-croatian', 'utf-32', 'utf-32BE', 'x-Chinese-CNS', 'x-cp20001', 'x-Chinese-Eten', 'x-cp20003', 'x-cp20004', 'x-cp20005', 'x-IA5', 'x-IA5-German', 'x-IA5-Swedish', 'x-IA5-Norwegian', 'us-ascii', 'x-cp20261', 'x-cp20269', 'IBM273', 'IBM277', 'IBM278', 'IBM280', 'IBM284', 'IBM285', 'IBM290', 'IBM297', 'IBM420', 'IBM423', 'IBM424', 'x-EBCDIC-KoreanExtended', 'IBM-Thai', 'koi8-r', 'IBM871', 'IBM880', 'IBM905', 'IBM00924', 'EUC-JP', 'x-cp20936', 'x-cp20949', 'cp1025', 'koi8-u', 'iso-8859-1', 'iso-8859-2', 'iso-8859-3', 'iso-8859-4', 'iso-8859-5', 'iso-8859-6', 'iso-8859-7', 'iso-8859-8', 'iso-8859-9', 'iso-8859-13', 'iso-8859-15', 'x-Europa', 'iso-8859-8-i', 'iso-2022-jp', 'csISO2022JP', 'iso-2022-jp', 'iso-2022-kr', 'x-cp50227', 'euc-jp', 'EUC-CN', 'euc-kr', 'hz-gb-2312', 'GB18030', 'x-iscii-de', 'x-iscii-be', 'x-iscii-ta', 'x-iscii-te', 'x-iscii-as', 'x-iscii-or', 'x-iscii-ka', 'x-iscii-ma', 'x-iscii-gu', 'x-iscii-pa', 'utf-7', 'utf-8')] [string]$sourceEncoding, [ValidateScript({Test-Path -Literal $_ -IsValid})] [string]$target, [ValidateSet('IBM037', 'IBM437', 'IBM500', 'ASMO-708', 'DOS-720', 'ibm737', 'ibm775', 'ibm850', 'ibm852', 'IBM855', 'ibm857', 'IBM00858', 'IBM860', 'ibm861', 'DOS-862', 'IBM863', 'IBM864', 'IBM865', 'cp866', 'ibm869', 'IBM870', 'windows-874', 'cp875', 'shift_jis', 'gb2312', 'ks_c_5601-1987', 'big5', 'IBM1026', 'IBM01047', 'IBM01140', 'IBM01141', 'IBM01142', 'IBM01143', 'IBM01144', 'IBM01145', 'IBM01146', 'IBM01147', 'IBM01148', 'IBM01149', 'utf-16', 'utf-16BE', 'windows-1250', 'windows-1251', 'Windows-1252', 'windows-1253', 'windows-1254', 'windows-1255', 'windows-1256', 'windows-1257', 'windows-1258', 'Johab', 'macintosh', 'x-mac-japanese', 'x-mac-chinesetrad', 'x-mac-korean', 'x-mac-arabic', 'x-mac-hebrew', 'x-mac-greek', 'x-mac-cyrillic', 'x-mac-chinesesimp', 'x-mac-romanian', 'x-mac-ukrainian', 'x-mac-thai', 'x-mac-ce', 'x-mac-icelandic', 'x-mac-turkish', 'x-mac-croatian', 'utf-32', 'utf-32BE', 'x-Chinese-CNS', 'x-cp20001', 'x-Chinese-Eten', 'x-cp20003', 'x-cp20004', 'x-cp20005', 'x-IA5', 'x-IA5-German', 'x-IA5-Swedish', 'x-IA5-Norwegian', 'us-ascii', 'x-cp20261', 'x-cp20269', 'IBM273', 'IBM277', 'IBM278', 'IBM280', 'IBM284', 'IBM285', 'IBM290', 'IBM297', 'IBM420', 'IBM423', 'IBM424', 'x-EBCDIC-KoreanExtended', 'IBM-Thai', 'koi8-r', 'IBM871', 'IBM880', 'IBM905', 'IBM00924', 'EUC-JP', 'x-cp20936', 'x-cp20949', 'cp1025', 'koi8-u', 'iso-8859-1', 'iso-8859-2', 'iso-8859-3', 'iso-8859-4', 'iso-8859-5', 'iso-8859-6', 'iso-8859-7', 'iso-8859-8', 'iso-8859-9', 'iso-8859-13', 'iso-8859-15', 'x-Europa', 'iso-8859-8-i', 'iso-2022-jp', 'csISO2022JP', 'iso-2022-jp', 'iso-2022-kr', 'x-cp50227', 'euc-jp', 'EUC-CN', 'euc-kr', 'hz-gb-2312', 'GB18030', 'x-iscii-de', 'x-iscii-be', 'x-iscii-ta', 'x-iscii-te', 'x-iscii-as', 'x-iscii-or', 'x-iscii-ka', 'x-iscii-ma', 'x-iscii-gu', 'x-iscii-pa', 'utf-7', 'utf-8')] [string]$targetEncoding ) { $reader = [IO.StreamReader]::new( $source, [Text.Encoding]::GetEncoding($sourceEncoding) ) $writer = [IO.StreamWriter]::new( $target, $false, # don't append = overwrite [Text.Encoding]::GetEncoding($targetEncoding) ) $buf = [char[]]::new(16MB) $nRead = $reader.Read($buf, 0, $buf.Length) $writer.Write( ([regex]"(<\?xml [^>]*?encoding="")(?i)$sourceEncoding(?="")").Replace( [string]::new($buf, 0, $nRead), '$1' + $targetEncoding, 1 # speedup: one replacement only ) ) while (!$reader.EndOfStream) { $nRead = $reader.Read($buf, 0, $buf.Length) $writer.Write($buf, 0, $nRead) } $reader.Close() $writer.Close() }
Usage:
transcodeXML 'r:\1.xml' iso-8859-1 'r:\2.xml' utf-8
I would suggest to pull the XML into an System.Xml.Linq.XDocument
with the Load
method and then change the Encoding
property of the Declaration
property (https://msdn.microsoft.com/en-us/library/system.xml.linq.xdocument.declaration(v=vs.110).aspx) of that XDocument
as needed or add one if Declaration
is null and the finally you can use the Save
method to save the changed document.
来源:https://stackoverflow.com/questions/39869353/convert-xml-latin1-to-utf-8-and-other-way-around