I\'m working on a project where I need to convert text from an encoding (for example Windows-1256 Arabic) to UTF-8.
How do I do this in Go?
I checked out the docs, here, and I came up with a way to convert an array of bytes to (or from) UTF-8.
What I have a hard time with is that, so far, I've not found an interface that would allow me to use a locale. Instead, it's like the possible ways are limited to predefined sets of encodings.
In my case, I needed to convert UTF-16 (really I have USC-2 data, but it should still work) to UTF-8. To do that, I needed to check for the BOM and then do the conversion:
bom := buf[0] + buf[1] * 256
if bom == 0xFEFF {
enc = unicode.UTF16(unicode.LittleEndian, unicode.IgnoreBOM)
} else if bom == 0xFFFE {
enc = unicode.UTF16(unicode.BigEndian, unicode.IgnoreBOM)
} else {
return Error("BOM missing")
}
e := enc.NewDecoder()
// convert USC-2 (LE or BE) to UTF-8
utf8 := e.Bytes(buf[2:])
Unfortunate that I have to use "ignore" BOM since in my case it should instead be forbidden past the first character. But that's close enough for my situation. These functions were mentioned in a couple of places, but not shown in practice.
Use modules from golang.org/x/text. In your case this would be something like:
b := /* Win1256 bytes here. */
dec := charmap.Windows1256.NewDecoder()
// Take more space just in case some characters need
// more bytes in UTF-8 than in Win1256.
bUTF := make([]byte, len(b)*3)
n, _, err := dec.Transform(bUTF, b, false)
if err != nil {
panic(err)
}
bUTF = bUTF[:n]
You can use the encoding package, which includes support for Windows-1256 via the package golang.org/x/text/encoding/charmap
(in the example below, import this package and use charmap.Windows1256
instead of japanese.ShiftJIS
).
Here's a short example which encodes a japanese UTF-8 string to ShiftJIS encoding and then decodes the ShiftJIS string back to UTF-8. Unfortunately it doesn't work on the playground since the playground doesn't have the "x" packages.
package main
import (
"bytes"
"fmt"
"io/ioutil"
"strings"
"golang.org/x/text/encoding/japanese"
"golang.org/x/text/transform"
)
func main() {
// the string we want to transform
s := "今日は"
fmt.Println(s)
// --- Encoding: convert s from UTF-8 to ShiftJIS
// declare a bytes.Buffer b and an encoder which will write into this buffer
var b bytes.Buffer
wInUTF8 := transform.NewWriter(&b, japanese.ShiftJIS.NewEncoder())
// encode our string
wInUTF8.Write([]byte(s))
wInUTF8.Close()
// print the encoded bytes
fmt.Printf("%#v\n", b)
encS := b.String()
fmt.Println(encS)
// --- Decoding: convert encS from ShiftJIS to UTF8
// declare a decoder which reads from the string we have just encoded
rInUTF8 := transform.NewReader(strings.NewReader(encS), japanese.ShiftJIS.NewDecoder())
// decode our string
decBytes, _ := ioutil.ReadAll(rInUTF8)
decS := string(decBytes)
fmt.Println(decS)
}
There's a more complete example on the Japanese StackOverflow site. The text is Japanese, but the code should be self-explanatory: https://ja.stackoverflow.com/questions/6120
I made a tool for myself, maybe you could borrow some idea from it :)
https://github.com/gonejack/transcode
This is the key code:
_, err = io.Copy(
transform.NewWriter(output, targetEncoding.NewEncoder()),
transform.NewReader(input, sourceEncoding.NewDecoder()),
)