How to convert from an encoding to UTF-8 in Go?

前端 未结 4 770
臣服心动
臣服心动 2020-12-30 06:23

I\'m working on a project where I need to convert text from an encoding (for example Windows-1256 Arabic) to UTF-8.

How do I do this in Go?

相关标签:
4条回答
  • 2020-12-30 07:03

    I checked out the docs, here, and I came up with a way to convert an array of bytes to (or from) UTF-8.

    What I have a hard time with is that, so far, I've not found an interface that would allow me to use a locale. Instead, it's like the possible ways are limited to predefined sets of encodings.

    In my case, I needed to convert UTF-16 (really I have USC-2 data, but it should still work) to UTF-8. To do that, I needed to check for the BOM and then do the conversion:

    bom := buf[0] + buf[1] * 256
    if bom == 0xFEFF {
        enc = unicode.UTF16(unicode.LittleEndian, unicode.IgnoreBOM)
    } else if bom == 0xFFFE {
        enc = unicode.UTF16(unicode.BigEndian, unicode.IgnoreBOM)
    } else {
        return Error("BOM missing")
    }
    
    e := enc.NewDecoder()
    
    // convert USC-2 (LE or BE) to UTF-8
    utf8 := e.Bytes(buf[2:])
    

    Unfortunate that I have to use "ignore" BOM since in my case it should instead be forbidden past the first character. But that's close enough for my situation. These functions were mentioned in a couple of places, but not shown in practice.

    0 讨论(0)
  • 2020-12-30 07:17

    Use modules from golang.org/x/text. In your case this would be something like:

    b := /* Win1256 bytes here. */
    dec := charmap.Windows1256.NewDecoder()
    // Take more space just in case some characters need
    // more bytes in UTF-8 than in Win1256.
    bUTF := make([]byte, len(b)*3)
    n, _, err := dec.Transform(bUTF, b, false)
    if err != nil {
        panic(err)
    }
    bUTF = bUTF[:n]
    
    0 讨论(0)
  • 2020-12-30 07:24

    You can use the encoding package, which includes support for Windows-1256 via the package golang.org/x/text/encoding/charmap (in the example below, import this package and use charmap.Windows1256 instead of japanese.ShiftJIS).

    Here's a short example which encodes a japanese UTF-8 string to ShiftJIS encoding and then decodes the ShiftJIS string back to UTF-8. Unfortunately it doesn't work on the playground since the playground doesn't have the "x" packages.

    package main
    
    import (
        "bytes"
        "fmt"
        "io/ioutil"
        "strings"
    
        "golang.org/x/text/encoding/japanese"
        "golang.org/x/text/transform"
    )
    
    func main() {
        // the string we want to transform
        s := "今日は"
        fmt.Println(s)
    
        // --- Encoding: convert s from UTF-8 to ShiftJIS 
        // declare a bytes.Buffer b and an encoder which will write into this buffer
        var b bytes.Buffer
        wInUTF8 := transform.NewWriter(&b, japanese.ShiftJIS.NewEncoder())
        // encode our string
        wInUTF8.Write([]byte(s))
        wInUTF8.Close()
        // print the encoded bytes
        fmt.Printf("%#v\n", b)
        encS := b.String()
        fmt.Println(encS)
    
        // --- Decoding: convert encS from ShiftJIS to UTF8
        // declare a decoder which reads from the string we have just encoded
        rInUTF8 := transform.NewReader(strings.NewReader(encS), japanese.ShiftJIS.NewDecoder())
        // decode our string
        decBytes, _ := ioutil.ReadAll(rInUTF8)
        decS := string(decBytes)
        fmt.Println(decS)
    }
    

    There's a more complete example on the Japanese StackOverflow site. The text is Japanese, but the code should be self-explanatory: https://ja.stackoverflow.com/questions/6120

    0 讨论(0)
  • 2020-12-30 07:26

    I made a tool for myself, maybe you could borrow some idea from it :)

    https://github.com/gonejack/transcode

    This is the key code:

    _, err = io.Copy(
        transform.NewWriter(output, targetEncoding.NewEncoder()),
        transform.NewReader(input, sourceEncoding.NewDecoder()),
    )
    
    0 讨论(0)
提交回复
热议问题