问题
This code:
my $þor-blob = Blob.new("þor".ords);
$þor-blob.decode( "ascii", :replacement("0"), :strict(False) ).say
Fails with:
Will not decode invalid ASCII (code point > 127 found)
And this one:
my $euro = Blob.new("3€".ords);
$euro.decode( "latin1", :replacement("euro") ).say
Simply does not seem to work, replacing € by ¬.
It's true that those methods are not tested, but is the syntax right?
回答1:
TL;DR:
Only samcv or some other core dev can provide an authoritative answer. This is my understanding of the code, comments, and results I see.
If my understanding is correct, some doc and/or code needs to be sorted out to render this SO moot.1
Specifying the
$replacement
argument matches a different P6 core multi method than not doing so. Let's call it the "replacer" code path.The "replacer" code path passes the
$replacement
and$strict
arguments onto a code path in nqp that in turn passes them onto a code path in the backend that handles replacements.On the MoarVM backend, the replacement and strict arguments are passed onto the decoders for the windows1252, windows1251, and shiftjis encodings but not for other encodings.2
Following the relevant code path
Your code calls this code in Buf.pm6:
multi method decode(Blob:D: $encoding,
Str :$replacement!,
Bool:D :$strict = False) {
nqp::p6box_s(
nqp::decoderepconf(
self,
Rakudo::Internals.NORMALIZE_ENCODING($encoding),
$replacement.defined ?? $replacement !! nqp::null_s(),
$strict ?? 0 !! 1))
}
The nqp::decoderepconf
function directly maps to a corresponding function in the backend.
On the MoarVM backend, it's MVM_string_decode_from_buf_config in ops.c.
This in turn calls MVM_string_decode_config in the same file.
From this latter function's comments, there are a couple key sentences that presumably explain the relevance of the replacement and strictness arguments:
Unlike
MVM_string_decode
, it will not pass through codepoints which have no official mapping.For now windows-1252 and windows-1251 are the only ones this makes a difference on.
Spelunking the code and commits in the repo suggests the latter comment is slightly out-of-date because it looks like it should make a difference on shiftjis too.
Also, to be clear, if one specifies the $replacement
argument in P6 then the $strict
argument is going to end up being ignored (and $strict = True
assumed) if decoding any encoding other than the windows or shiftjis encodings.2
What happens with ascii and latin1 in particular
The current code for MVM_string_decode_config
does not pass on the replacement/strictness arguments to the MVM_string_ascii_decode
and MVM_string_latin1_decode
functions.
So, if you use the encoding "ascii" then the blob must only contain values between 0 and 127, and for "latin1" the values must be between 0 and 255.
say "þor".ords; # (254 111 114)
say "3€".ords; # (51 8364)
The first string (as a Buf
) fails to decode, and instead produces an error message, because 254 is more than 127 and the ascii decoder code in MoarVM reacts to an invalid value by throwing an exception with the "invalid ASCII" message.
The second replaces €
with ¬
. This is because by default a Buf
is an 8 bit array, so a value above 255 gets truncated to its low byte, which for €
is the same as ¬
(in both latin1 and Unicode).3
But it's no better if you use a Buf
with a larger element size. The result is still a ¬
, combined with tofu. I can see even if I can't C so it's clear to me that the MVM_string_latin1_decode function in MoarVM that decodes latin1 does not throw exceptions. So presumably when it encounters character values outside the range 0-255 it turns the higher bytes into tofu.
Footnotes
1 Of course the very thing JJ is doing that led them to post this SO in the first place is fixing the doc. I added this footnote so that other later readers would understand that context and realize that this SO is leading to changes in the doc, and may lead to changes in the code, that will presumably render this SO moot due to the work done.
2 It would be nice if there were multis that rejected use of the $replacement
argument if the decoder for the specified encoding doesn't do anything with it.
3 See timotimo++'s comment below.
来源:https://stackoverflow.com/questions/55353143/blob-decode-with-replacement-does-not-seem-to-work