问题
My Go program makes HTTP requests whose response bodies are large JSON documents whose strings encode the ampersand character &
as &
(presumably due to some Microsoft platform quirk?). My program needs to convert those entities back to the ampersand character in a way that is compatible with json.Decoder.
An example response might look like the following:
{"name":"A&B","comment":"foo&bar"}
Whose corresponding object would be as below:
pkg.Object{Name:"A&B", Comment:"foo&bar"}
The documents come in various shapes so it's not feasible to convert the HTML entities after decoding. Ideally it would be done by wrapping the response body reader in another reader that performs the transformation.
Is there an easy way to wrap the http.Response.Body
in some io.ReadCloser
which replaces all instances of &
with &
(or in the general case, replaces any string X with string Y)?
I suspect this is possible with x/text/transform but don't immediately see how. In particular, I'm concerned about edge cases wherein an entity spans batches of bytes. That is, one batch ends with &am
and the next batch starts with p;
, for example. Is there some library or idiom that gracefully handles that situation?
回答1:
If you don't want to rely on an external package like transform.Reader
you can write a custom io.Reader
wrapper.
The following will handle the edge case where the find
element may span two Read()
calls:
type fixer struct {
r io.Reader // source reader
fnd, rpl []byte // find & replace sequences
partial int // track partial find matches from previous Read()
}
// Read satisfies io.Reader interface
func (f *fixer) Read(b []byte) (int, error) {
off := f.partial
if off > 0 {
copy(b, f.fnd[:off]) // copy any partial match from previous `Read`
}
n, err := f.r.Read(b[off:])
n += off
if err != io.EOF {
// no need to check for partial match, if EOF, as that is the last Read!
f.partial = partialFind(b[:n], f.fnd)
n -= f.partial // lop off any partial bytes
}
fixb := bytes.ReplaceAll(b[:n], f.fnd, f.rpl)
return copy(b, fixb), err // preserve err as it may be io.EOF etc.
}
Along with this helper (which could probably use some optimization):
// returns number of matched bytes, if byte-slice ends in a partial-match
func partialFind(b, find []byte) int {
for n := len(find) - 1; n > 0; n-- {
if bytes.HasSuffix(b, find[:n]) {
return n
}
}
return 0 // no match
}
Working playground example.
Note: to test the edge-case logic, one could use a narrowReader
to ensure short Read
's and force a match is split across Read
s like this: validation playground example
回答2:
You need to create a transform.Transformer that replaces your characters.
So we need one that transforms an old []byte
to a new []byte
while preserving all other data. An implementation could look like this:
type simpleTransformer struct {
Old, New []byte
}
// Transform transforms `t.Old` bytes to `t.New` bytes.
// The current implementation assumes that len(t.Old) >= len(t.New), but it also seems to work when len(t.Old) < len(t.New) (this has not been tested extensively)
func (t *simpleTransformer) Transform(dst, src []byte, atEOF bool) (nDst, nSrc int, err error) {
// Get the position of the first occurance of `t.Old` so we can replace it
var ci = bytes.Index(src[nSrc:], t.Old)
// Loop over the slice until we can't find any occurances of `t.Old`
// also make sure we don't run into index out of range panics
for ci != -1 && nSrc < len(src) {
// Copy source data before `nSrc+ci` that doesn't need transformation
copied := copy(dst[nDst:nDst+ci], src[nSrc:nSrc+ci])
nDst += copied
nSrc += copied
// Copy new data with transformation to `dst`
nDst += copy(dst[nDst:nDst+len(t.New)], t.New)
// Skip the rest of old bytes in the next iteration
nSrc += len(t.Old)
// search for the next occurance of `t.Old`
ci = bytes.Index(src[nSrc:], t.Old)
}
// Mark the rest of data as not completely processed if it contains a start element of `t.Old`
// (e.g. if the end is `&` and we're looking for `&`)
// This data will not yet be copied to `dst` so we can work with it again
// If it is at the end (`atEOF`), we don't need to do the check anymore as the string might just end with `&`
if bytes.Contains(src[nSrc:], t.Old[0:1]) && !atEOF {
err = transform.ErrShortSrc
return
}
// Copy rest of data that doesn't need any transformations
// The for loop processed everything except this last chunk
copied := copy(dst[nDst:], src[nSrc:])
nDst += copied
nSrc += copied
return nDst, nSrc, err
}
// To satisfy transformer.Transformer interface
func (t *simpleTransformer) Reset() {}
The implementation has to make sure that it deals with characters that are split between multible calls of the Transform
method, which is why it returns transform.ErrShortSrc
to tell the transform.Reader
that it needs more information about the next bytes.
This can now be used to replace characters in a stream:
var input = strings.NewReader(`{"name":"A&B","comment":"foo&bar"}`)
r := transform.NewReader(input, &simpleTransformer{[]byte(`&`), []byte(`&`)})
io.Copy(os.Stdout, r) // Instead of io.Copy, use the JSON decoder to read from `r`
Output:
{"name":"A&B","comment":"foo&bar"}
You can also see this in action on the Go Playground.
来源:https://stackoverflow.com/questions/60366710/how-to-transform-html-entities-via-io-reader