I come from a Python/Ruby/JavaScript background. I understand how pointers work, however, I\'m not completely sure how to leverage them in the following situation.
L
Will the [...] be garbage collected correctly?
Yes.
You never need to worry that something will be collected which is still in use and you can rely on everything being collected once it is no longer used.
So the question about GC is never "Will it be collected correctly?" but "Do I generate unnecessary garbage?". Now this actual question does not depend that much on the data structure than on the amount of neu objects created (on the heap). So this is a question about how the data structures are used and much less on the structure itself. Use benchmarks and run go test with -benchmem.
(High end performance might also consider how much work the GC has to do: Scanning pointers might take time. Forget that for now.)
The other relevant question is about memory consumption. Copying a string copies just three words while copying a *string copies one word. So there is not much to safe here by using *string.
So unfortunately there are no clear answers to the relevant questions (amount of garbage generated and total memory consumption). Don't overthink the problem, use what fits your purpose, measure and refactor.
Foreword: I released the presented string pool in my github.com/icza/gox library, see stringsx.Pool.
First some background. string
values in Go are represented by a small struct-like data structure reflect.StringHeader:
type StringHeader struct {
Data uintptr
Len int
}
So basically passing / copying a string
value passes / copies this small struct value, which is 2 words only regardless of the length of the string
. On 64-bit architectures, it's only 16 bytes, even if the string
has a thousand characters.
So basically string
values already act as pointers. Introducing another pointer like *string
just complicates usage, and you won't really gain any noticable memory. For the sake of memory optimization, forget about using *string
.
It works and my first question is what happens to the result data structure after I build the mapping in this way? Will the Image URL string fields be left in memory somehow and the rest of the result will be garbage collected? Or will the result data structure stay in memory until the end of the program because something points to its members?
If you have a pointer value pointing to a field of a struct value, then the whole struct will be kept in memory, it can't be garbage collected. Note that although it could be possible to release memory reserved for other fields of the struct, but the current Go runtime and garbage collector does not do so. So to achieve optimal memory usage, you should forget about storing addresses of struct fields (unless you also need the complete struct values, but still, storing field addresses and slice/array element addresses always requires care).
The reason for this is because memory for struct values are allocated as a contiguous segment, and so keeping only a single referenced field would strongly fragment the available / free memory, and would make optimal memory management even harder and less efficient. Defragmenting such areas would also require copying the referenced field's memory area, which would require "live-changing" pointer values (changing memory addresses).
So while using pointers to string
values may save you some tiny memory, the added complexity and additional indirections make it unworthy.
So what to do then?
So the cleanest way is to keep using string
values.
And there is one more optimization we didn't talk about earlier.
You get your results by unmarshaling a JSON API response. This means that if the same URL or tag value is included multiple times in the JSON response, different string
values will be created for them.
What does this mean? If you have the same URL twice in the JSON response, after unmarshaling, you will have 2 distinct string
values which will contain 2 different pointers pointing to 2 different allocated byte sequences (string content which otherwise will be the same). The encoding/json
package does not do string
interning.
Here's a little app that proves this:
var s []string
err := json.Unmarshal([]byte(`["abc", "abc", "abc"]`), &s)
if err != nil {
panic(err)
}
for i := range s {
hdr := (*reflect.StringHeader)(unsafe.Pointer(&s[i]))
fmt.Println(hdr.Data)
}
Output of the above (try it on the Go Playground):
273760312
273760315
273760320
We see 3 different pointers. They could be the same, as string
values are immutable.
The json
package does not detect repeating string
values because the detection adds memory and computational overhead, which is obviously something unwanted. But in our case we shoot for optimal memory usage, so an "initial", additional computation does worth the big memory gain.
So let's do our own string interning. How to do that?
After unmarshaling the JSON result, during building the tagToUrlMap
map, let's keep track of string
values we have come across, and if the subsequent string
value has been seen earlier, just use that earlier value (its string descriptor).
Here's a very simple string interner implementation:
var cache = map[string]string{}
func interned(s string) string {
if s2, ok := cache[s]; ok {
return s2
}
// New string, store it
cache[s] = s
return s
}
Let's test this "interner" in the example code above:
var s []string
err := json.Unmarshal([]byte(`["abc", "abc", "abc"]`), &s)
if err != nil {
panic(err)
}
for i := range s {
hdr := (*reflect.StringHeader)(unsafe.Pointer(&s[i]))
fmt.Println(hdr.Data, s[i])
}
for i := range s {
s[i] = interned(s[i])
}
for i := range s {
hdr := (*reflect.StringHeader)(unsafe.Pointer(&s[i]))
fmt.Println(hdr.Data, s[i])
}
Output of the above (try it on the Go Playground):
273760312 abc
273760315 abc
273760320 abc
273760312 abc
273760312 abc
273760312 abc
Wonderful! As we can see, after using our interned()
function, only a single instance of the "abc"
string is used in our data structure (which is actually the first occurrence). This means all other instances (given no one else uses them) can be–and will be–properly garbage collected (by the garbage collector, some time in the future).
One thing to not forget here: the string interner uses a cache
dictionary which stores all previously encountered string values. So to let those strings go, you should "clear" this cache map too, simplest done by assigning a nil
value to it.
Without further ado, let's see our solution:
result := searchImages()
tagToUrlMap := make(map[string][]string)
for _, image := range result {
imageURL := interned(image.URL)
for _, tag := range image.Tags {
tagName := interned(tag.Name)
tagToUrlMap[tagName] = append(tagToUrlMap[tagName], imageURL)
}
}
// Clear the interner cache:
cache = nil
To verify the results:
enc := json.NewEncoder(os.Stdout)
enc.SetIndent("", " ")
if err := enc.Encode(tagToUrlMap); err != nil {
panic(err)
}
Output is (try it on the Go Playground):
{
"blue": [
"https://c8.staticflickr.com/4/3707/11603200203_87810ddb43_o.jpg"
],
"bridge": [
"https://c3.staticflickr.com/1/48/164626048_edeca27ed7_o.jpg"
],
"forest": [
"https://c8.staticflickr.com/4/3707/11603200203_87810ddb43_o.jpg",
"https://c3.staticflickr.com/1/48/164626048_edeca27ed7_o.jpg"
],
"ocean": [
"https://c8.staticflickr.com/4/3707/11603200203_87810ddb43_o.jpg"
],
"river": [
"https://c3.staticflickr.com/1/48/164626048_edeca27ed7_o.jpg"
],
"water": [
"https://c8.staticflickr.com/4/3707/11603200203_87810ddb43_o.jpg",
"https://c3.staticflickr.com/1/48/164626048_edeca27ed7_o.jpg"
]
}
We used the builtin append() function to add new image URLs to tags. append()
may (and usually does) allocate bigger slices than needed (thinking of future growth). After our "build" process, we may go through our tagToUrlMap
map and "trim" those slices to the minimum needed.
This is how it could be done:
for tagName, urls := range tagToUrlMap {
if cap(urls) > len(urls) {
urls2 := make([]string, len(urls))
copy(urls2, urls)
tagToUrlMap[tagName] = urls2
}
}