In Go, to access elements of a string
, we can write:
str := \"text\"
for i, c := range str {
// str[i] is of type byte
// c is of type rune
Which one of the following methods are better performance-wise?
Definitely not this.
str := "large text"
str2 := []byte(str)
for _, s := range str2 {
// use s
}
Strings are immutable. []byte
is mutable. That means []byte(str)
makes a copy. So the above will copy the entire string. I've found being unaware of when strings are copied to be a major source of performance problems for large strings.
If str2
is never altered, the compiler may optimize away the copy. For this reason, it's better to write the above like so to ensure the byte array is never altered.
str := "large text"
for _, s := range []byte(str) {
// use s
}
That way there's no str2
to possibly be modified later and ruin the optimization.
But this is a bad idea because it will corrupt any multi-byte characters. See below.
As for the byte/rune conversion, performance is not a consideration as they are not equivalent. c
will be a rune, and str[i]
will be a byte. If your string contains multi-byte characters, you have to use runes.
For example...
package main
import(
"fmt"
)
func main() {
str := "snow ☃ man"
for i, c := range str {
fmt.Printf("c:%c str[i]:%c\n", c, str[i])
}
}
$ go run ~/tmp/test.go
c:s str[i]:s
c:n str[i]:n
c:o str[i]:o
c:w str[i]:w
c: str[i]:
c:☃ str[i]:â
c: str[i]:
c:m str[i]:m
c:a str[i]:a
c:n str[i]:n
Note that using str[i]
corrupts the multi-byte Unicode snowman, it only contains the first byte of the multi-byte character.
There's no performance difference anyway as range str
already must do the work to go character-by-character, not byte by byte.
string
values in Go store the UTF-8 encoded bytes of the text, not its characters or rune
s.
Indexing a string
indexes its bytes: str[i]
is of type byte
(or uint8
, its an alias). Also a string
is in effect a read-only slice of bytes (with some syntactic sugar). Indexing a string
does not require converting it to a slice.
When you use for ... range
on a string
, that iterates over the rune
s of the string
, not its bytes!
So if you want to iterate over the runes
(characters), you must use a for ... range
but without a conversion to []byte
, as the first form will not work with string
values containing multi(UTF-8)-byte characters.
The spec allows you to for ... range
on a string
value, and the 1st iteration value will be the byte-index of the current character, the 2nd value will be the current character value of type rune
(which is an alias to int32
):
For a string value, the "range" clause iterates over the Unicode code points in the string starting at byte index 0. On successive iterations, the index value will be the index of the first byte of successive UTF-8-encoded code points in the string, and the second value, of type rune, will be the value of the corresponding code point. If the iteration encounters an invalid UTF-8 sequence, the second value will be 0xFFFD, the Unicode replacement character, and the next iteration will advance a single byte in the string.
Simple example:
s := "Hi 世界"
for i, c := range s {
fmt.Printf("Char pos: %d, Char: %c\n", i, c)
}
Output (try it on the Go Playground):
Char pos: 0, Char: H
Char pos: 1, Char: i
Char pos: 2, Char:
Char pos: 3, Char: 世
Char pos: 6, Char: 界
Must read blog post for you:
The Go Blog: Strings, bytes, runes and characters in Go
Note: If you must iterate over the bytes of a string
(and not its characters), using a for ... range
with a converted string
like your second example does not make a copy, it's optimized away. For details, see golang: []byte(string) vs []byte(*string).