Does accessing elements of string as byte perform conversion?

风流意气都作罢 提交于 2019-11-28 09:48:21

问题


In Go, to access elements of a string, we can write:

str := "text"
for i, c := range str {
  // str[i] is of type byte
  // c is of type rune
}

When accessing str[i] does Go perform a conversion from rune to byte? I would guess the answer is yes, but I am not sure. If so, then, which one of the following methods are better performance-wise? Is one preferred over another (in terms of best practice, for example)?

str := "large text"
for i := range str {
  // use str[i]
}

or

str := "large text"
str2 := []byte(str)
for _, s := range str2 {
  // use s
}

回答1:


Which one of the following methods are better performance-wise?

Definitely not this.

str := "large text"
str2 := []byte(str)
for _, s := range str2 {
  // use s
}

Strings are immutable. []byte is mutable. That means []byte(str) makes a copy. So the above will copy the entire string. I've found being unaware of when strings are copied to be a major source of performance problems for large strings.

If str2 is never altered, the compiler may optimize away the copy. For this reason, it's better to write the above like so to ensure the byte array is never altered.

str := "large text"
for _, s := range []byte(str) {
  // use s
}

That way there's no str2 to possibly be modified later and ruin the optimization.

But this is a bad idea because it will corrupt any multi-byte characters. See below.


As for the byte/rune conversion, performance is not a consideration as they are not equivalent. c will be a rune, and str[i] will be a byte. If your string contains multi-byte characters, you have to use runes.

For example...

package main

import(
    "fmt"
)

func main() {
    str := "snow ☃ man"
    for i, c := range str {
        fmt.Printf("c:%c str[i]:%c\n", c, str[i])
    }
}

$ go run ~/tmp/test.go
c:s str[i]:s
c:n str[i]:n
c:o str[i]:o
c:w str[i]:w
c:  str[i]: 
c:☃ str[i]:â
c:  str[i]: 
c:m str[i]:m
c:a str[i]:a
c:n str[i]:n

Note that using str[i] corrupts the multi-byte Unicode snowman, it only contains the first byte of the multi-byte character.

There's no performance difference anyway as range str already must do the work to go character-by-character, not byte by byte.




回答2:


string values in Go store the UTF-8 encoded bytes of the text, not its characters or runes.

Indexing a string indexes its bytes: str[i] is of type byte (or uint8, its an alias). Also a string is in effect a read-only slice of bytes (with some syntactic sugar). Indexing a string does not require converting it to a slice.

When you use for ... range on a string, that iterates over the runes of the string, not its bytes!

So if you want to iterate over the runes (characters), you must use a for ... range but without a conversion to []byte, as the first form will not work with string values containing multi(UTF-8)-byte characters. The spec allows you to for ... range on a string value, and the 1st iteration value will be the byte-index of the current character, the 2nd value will be the current character value of type rune (which is an alias to int32):

For a string value, the "range" clause iterates over the Unicode code points in the string starting at byte index 0. On successive iterations, the index value will be the index of the first byte of successive UTF-8-encoded code points in the string, and the second value, of type rune, will be the value of the corresponding code point. If the iteration encounters an invalid UTF-8 sequence, the second value will be 0xFFFD, the Unicode replacement character, and the next iteration will advance a single byte in the string.

Simple example:

s := "Hi 世界"
for i, c := range s {
    fmt.Printf("Char pos: %d, Char: %c\n", i, c)
}

Output (try it on the Go Playground):

Char pos: 0, Char: H
Char pos: 1, Char: i
Char pos: 2, Char:  
Char pos: 3, Char: 世
Char pos: 6, Char: 界

Must read blog post for you:

The Go Blog: Strings, bytes, runes and characters in Go


Note: If you must iterate over the bytes of a string (and not its characters), using a for ... range with a converted string like your second example does not make a copy, it's optimized away. For details, see golang: []byte(string) vs []byte(*string).



来源:https://stackoverflow.com/questions/44487910/does-accessing-elements-of-string-as-byte-perform-conversion

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!