Does accessing elements of string as byte perform conversion?

后端 未结 2 945
囚心锁ツ
囚心锁ツ 2020-12-21 16:58

In Go, to access elements of a string, we can write:

str := \"text\"
for i, c := range str {
  // str[i] is of type byte
  // c is of type rune
         


        
相关标签:
2条回答
  • 2020-12-21 17:46

    Which one of the following methods are better performance-wise?

    Definitely not this.

    str := "large text"
    str2 := []byte(str)
    for _, s := range str2 {
      // use s
    }
    

    Strings are immutable. []byte is mutable. That means []byte(str) makes a copy. So the above will copy the entire string. I've found being unaware of when strings are copied to be a major source of performance problems for large strings.

    If str2 is never altered, the compiler may optimize away the copy. For this reason, it's better to write the above like so to ensure the byte array is never altered.

    str := "large text"
    for _, s := range []byte(str) {
      // use s
    }
    

    That way there's no str2 to possibly be modified later and ruin the optimization.

    But this is a bad idea because it will corrupt any multi-byte characters. See below.


    As for the byte/rune conversion, performance is not a consideration as they are not equivalent. c will be a rune, and str[i] will be a byte. If your string contains multi-byte characters, you have to use runes.

    For example...

    package main
    
    import(
        "fmt"
    )
    
    func main() {
        str := "snow ☃ man"
        for i, c := range str {
            fmt.Printf("c:%c str[i]:%c\n", c, str[i])
        }
    }
    
    $ go run ~/tmp/test.go
    c:s str[i]:s
    c:n str[i]:n
    c:o str[i]:o
    c:w str[i]:w
    c:  str[i]: 
    c:☃ str[i]:â
    c:  str[i]: 
    c:m str[i]:m
    c:a str[i]:a
    c:n str[i]:n
    

    Note that using str[i] corrupts the multi-byte Unicode snowman, it only contains the first byte of the multi-byte character.

    There's no performance difference anyway as range str already must do the work to go character-by-character, not byte by byte.

    0 讨论(0)
  • 2020-12-21 17:51

    string values in Go store the UTF-8 encoded bytes of the text, not its characters or runes.

    Indexing a string indexes its bytes: str[i] is of type byte (or uint8, its an alias). Also a string is in effect a read-only slice of bytes (with some syntactic sugar). Indexing a string does not require converting it to a slice.

    When you use for ... range on a string, that iterates over the runes of the string, not its bytes!

    So if you want to iterate over the runes (characters), you must use a for ... range but without a conversion to []byte, as the first form will not work with string values containing multi(UTF-8)-byte characters. The spec allows you to for ... range on a string value, and the 1st iteration value will be the byte-index of the current character, the 2nd value will be the current character value of type rune (which is an alias to int32):

    For a string value, the "range" clause iterates over the Unicode code points in the string starting at byte index 0. On successive iterations, the index value will be the index of the first byte of successive UTF-8-encoded code points in the string, and the second value, of type rune, will be the value of the corresponding code point. If the iteration encounters an invalid UTF-8 sequence, the second value will be 0xFFFD, the Unicode replacement character, and the next iteration will advance a single byte in the string.

    Simple example:

    s := "Hi 世界"
    for i, c := range s {
        fmt.Printf("Char pos: %d, Char: %c\n", i, c)
    }
    

    Output (try it on the Go Playground):

    Char pos: 0, Char: H
    Char pos: 1, Char: i
    Char pos: 2, Char:  
    Char pos: 3, Char: 世
    Char pos: 6, Char: 界
    

    Must read blog post for you:

    The Go Blog: Strings, bytes, runes and characters in Go


    Note: If you must iterate over the bytes of a string (and not its characters), using a for ... range with a converted string like your second example does not make a copy, it's optimized away. For details, see golang: []byte(string) vs []byte(*string).

    0 讨论(0)
提交回复
热议问题