Efficient appending to a variable-length container of strings (Golang)

前端 未结 2 772
逝去的感伤
逝去的感伤 2021-01-18 13:27

The problem:

I need to apply multiple regexes to each line of a big log file (like several GB long) , gather non-empty matches and put them all in an array (for seri

相关标签:
2条回答
  • 2021-01-18 13:36

    append()'s average (amortized) cost is already O(1) because it grows the array by a percentage each time. As the array gets larger, growing it gets costlier but proportionately rarer. A 10M-item slice will be 10x more expensive to grow than a 1M-item slice, but since the extra capacity we're allocating is proportional to the size, it'll also be 10x as many append(slice, item) calls until the next time it grows. The increasing cost and decreasing frequency of reallocations cancel out, leaving the average cost constant, i.e., O(1).

    The same idea applies other languages' dynamically-sized arrays, too: Microsoft's std::vector implementation apparently grows the array by 50% each time, for example. Amortized O(1) doesn't mean you pay nothing for allocations, only that you continue paying at the same average rate as your array gets bigger.

    On my laptop, I could run a million slice = append(slice, someStaticString)s in 77ms. One reason it's quick, which siritinga noted below, is that "copying" the string to enlarge the array is really just copying the string header (a pointer/length pair), not copying the contents. 100,000 string headers is still under 2MB to copy, not a big deal compared to the other quantities of data you're working with.

    container/list turned out ~3x slower for me in a microbenchmark; linked-list append is also constant time, of course, but I imagine append has a lower constant because it can usually just write to a couple words of memory and not allocate a list item, etc. The timing code won't work in the Playground but you can copy this locally and run it to see yourself: http://play.golang.org/p/uYyMScmOjX

    Sometimes, you can usefully pre-allocate space to avoid reallocations/copies (in this example, using make([]string, 0, 1000000) takes the runtime from ~77ms to ~10ms), but, of course, often-to-usually just you don't have enough info about the expected data size and so on to eke out worthwhile gains, and you're better off leaving it to the built-in algorithm.


    But you're asking a more specific question here about a grep-like application (and thanks for asking a detailed question with context). For that, bottom-line recommendation is that if you're searching gigs of logs, it's probably best to avoid buffering the whole output in RAM at all.

    You could write something to stream the results as a single function: logparser.Grep(in io.Reader, out io.Writer, patterns []regexp.Regexp); you could alternatively make out a chan []byte or func(match []byte) (err error) if you don't want the code that sends results to be too enmeshed with the grep code.

    (On []byte vs. string: a []byte seems to do the job here and avoids []byte<=>string conversions when you do I/O, so I'd prefer that. I don't know what all you're doing, though, and if you need string it's fine.)

    If you do keep the whole match list in RAM, be aware that keeping around a reference to part of a big string or byte slice keeps the whole source string/slice from being garbage collected. So if you go that route, then counterintuitively you may actually want to copy matches to avoid keeping all of the source log data in RAM.

    0 讨论(0)
  • 2021-01-18 13:46

    I tried to distill your question into a very simple example.

    Since there can be "hundreds of thousands of regex matches", I did a large initial allocation of 1 M (1024 * 1024) entries for the matches slice capacity. A slice is a reference type. A slice header 'struct' has length, a capacity, and a pointer for a total of 24 (3 * 8) bytes on a 64-bit OS. The initial allocation for a slice of 1 M entries is therefore only 24 (24 * 1) MB. If there are more than 1 M entries, a new slice with capacity of 1.25 (1 + 1 / 4) M entries will be allocated and the existing 1 M slice header entries (24 MB) will be copied to it.

    In summary, you can avoid much of the the overhead of many appends by initially over allocating the slice capacity. The bigger memory problem is all the data that is saved and referenced for each match. The far bigger CPU time problem is the time taken to perform the regexp.FindAll's.

    package main
    
    import (
        "bufio"
        "fmt"
        "os"
        "regexp"
    )
    
    var searches = []*regexp.Regexp{
        regexp.MustCompile("configure"),
        regexp.MustCompile("unknown"),
        regexp.MustCompile("PATH"),
    }
    
    var matches = make([][]byte, 0, 1024*1024)
    
    func main() {
        logName := "config.log"
        log, err := os.Open(logName)
        if err != nil {
            fmt.Fprintln(os.Stderr, err)
            os.Exit(1)
        }
        defer log.Close()
        scanner := bufio.NewScanner(log)
        for scanner.Scan() {
            line := scanner.Bytes()
            for _, s := range searches {
                for _, m := range s.FindAll(line, -1) {
                    matches = append(matches, append([]byte(nil), m...))
                }
            }
        }
        if err := scanner.Err(); err != nil {
            fmt.Fprintln(os.Stderr, err)
        }
        // Output matches
        fmt.Println(len(matches))
        for i, m := range matches {
            fmt.Println(string(m))
            if i >= 16 {
                break
            }
        }
    }
    
    0 讨论(0)
提交回复
热议问题