How to transform HTML entities via io.Reader

问题

My Go program makes HTTP requests whose response bodies are large JSON documents whose strings encode the ampersand character & as & (presumably due to some Microsoft platform quirk?). My program needs to convert those entities back to the ampersand character in a way that is compatible with json.Decoder.

An example response might look like the following:

{"name":"A&amp;B","comment":"foo&amp;bar"}

Whose corresponding object would be as below:

pkg.Object{Name:"A&B", Comment:"foo&bar"}

The documents come in various shapes so it's not feasible to convert the HTML entities after decoding. Ideally it would be done by wrapping the response body reader in another reader that performs the transformation.

Is there an easy way to wrap the http.Response.Body in some io.ReadCloser which replaces all instances of & with & (or in the general case, replaces any string X with string Y)?

I suspect this is possible with x/text/transform but don't immediately see how. In particular, I'm concerned about edge cases wherein an entity spans batches of bytes. That is, one batch ends with &am and the next batch starts with p;, for example. Is there some library or idiom that gracefully handles that situation?

回答1:

If you don't want to rely on an external package like transform.Reader you can write a custom io.Reader wrapper.

The following will handle the edge case where the find element may span two Read() calls:

type fixer struct {
    r        io.Reader // source reader
    fnd, rpl []byte    // find & replace sequences
    partial  int       // track partial find matches from previous Read()
}

// Read satisfies io.Reader interface
func (f *fixer) Read(b []byte) (int, error) {
    off := f.partial
    if off > 0 {
        copy(b, f.fnd[:off]) // copy any partial match from previous `Read`
    }

    n, err := f.r.Read(b[off:])
    n += off

    if err != io.EOF {
        // no need to check for partial match, if EOF, as that is the last Read!
        f.partial = partialFind(b[:n], f.fnd)
        n -= f.partial // lop off any partial bytes
    }

    fixb := bytes.ReplaceAll(b[:n], f.fnd, f.rpl)

    return copy(b, fixb), err // preserve err as it may be io.EOF etc.
}

Along with this helper (which could probably use some optimization):

// returns number of matched bytes, if byte-slice ends in a partial-match
func partialFind(b, find []byte) int {
    for n := len(find) - 1; n > 0; n-- {
        if bytes.HasSuffix(b, find[:n]) {
            return n
        }
    }
    return 0 // no match
}

Working playground example.

Note: to test the edge-case logic, one could use a narrowReader to ensure short Read's and force a match is split across Reads like this: validation playground example

回答2:

You need to create a transform.Transformer that replaces your characters.

So we need one that transforms an old []byte to a new []byte while preserving all other data. An implementation could look like this:

type simpleTransformer struct {
    Old, New []byte
}

// Transform transforms `t.Old` bytes to `t.New` bytes.
// The current implementation assumes that len(t.Old) >= len(t.New), but it also seems to work when len(t.Old) < len(t.New) (this has not been tested extensively)
func (t *simpleTransformer) Transform(dst, src []byte, atEOF bool) (nDst, nSrc int, err error) {
    // Get the position of the first occurance of `t.Old` so we can replace it
    var ci = bytes.Index(src[nSrc:], t.Old)

    // Loop over the slice until we can't find any occurances of `t.Old`
    // also make sure we don't run into index out of range panics
    for ci != -1 && nSrc < len(src) {
        // Copy source data before `nSrc+ci` that doesn't need transformation
        copied := copy(dst[nDst:nDst+ci], src[nSrc:nSrc+ci])
        nDst += copied
        nSrc += copied

        // Copy new data with transformation to `dst`
        nDst += copy(dst[nDst:nDst+len(t.New)], t.New)

        // Skip the rest of old bytes in the next iteration
        nSrc += len(t.Old)

        // search for the next occurance of `t.Old`
        ci = bytes.Index(src[nSrc:], t.Old)
    }

    // Mark the rest of data as not completely processed if it contains a start element of `t.Old`
    // (e.g. if the end is `&amp` and we're looking for `&amp;`)
    // This data will not yet be copied to `dst` so we can work with it again
    // If it is at the end (`atEOF`), we don't need to do the check anymore as the string might just end with `&amp` 
    if bytes.Contains(src[nSrc:], t.Old[0:1]) && !atEOF {
        err = transform.ErrShortSrc
        return
    }

    // Copy rest of data that doesn't need any transformations
    // The for loop processed everything except this last chunk
    copied := copy(dst[nDst:], src[nSrc:])
    nDst += copied
    nSrc += copied

    return nDst, nSrc, err
}

// To satisfy transformer.Transformer interface
func (t *simpleTransformer) Reset() {}

The implementation has to make sure that it deals with characters that are split between multible calls of the Transform method, which is why it returns transform.ErrShortSrc to tell the transform.Reader that it needs more information about the next bytes.

This can now be used to replace characters in a stream:

var input = strings.NewReader(`{"name":"A&amp;B","comment":"foo&amp;bar"}`)
r := transform.NewReader(input, &simpleTransformer{[]byte(`&amp;`), []byte(`&`)})
io.Copy(os.Stdout, r) // Instead of io.Copy, use the JSON decoder to read from `r`

Output:

{"name":"A&B","comment":"foo&bar"}

You can also see this in action on the Go Playground.

来源：https://stackoverflow.com/questions/60366710/how-to-transform-html-entities-via-io-reader

标签

streaming

transformation

html-entities