-
Notifications
You must be signed in to change notification settings - Fork 560
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tied separator and arguments in join() give unexpected results #21458
Comments
This feels like definitely a bug in 664 STRLEN delimlen;
665 const char * const delims = SvPV_const(delim, delimlen); This pointer is now just pointing into the I suspect a fix for this would be to copy the delim SV into a temporary buffer and use that. STRLEN delimlen;
const char * const delims = SvPV_const(delim, delimlen);
delims = savepvn(delims, delimlen);
SAVEFREEPV(delims); |
Attempted fix at: #21484 |
On Fri, Sep 15, 2023 at 01:50:11AM -0700, Paul Evans wrote:
Attempted fix at: https://github.com/leonerd/perl5/pull/new/gh21458
I think that's somewhat the wrong approach. It potentially penalises joins
performance-wise even when no magic or ties are present, and it doesn't
catch all possible side-effects of tying.
As a trivial example, what should happen if the FETCH() method for a tied
argument modifies the (non-tied) separator? E.g.:
{
package Foo;
my $sep = '-';
sub TIESCALAR { bless [] }
sub FETCH { $sep = ':'; return 'X' }
tie my $tied, 'Foo';
print join($sep, "A", "B", $tied, "C", "D"), "\n";
}
Should that print "A-B-X:C:D"? It currently prints "A-B-X-C-D".
But if you change it so that $sep is non-COW:
my $n = 1;
my $sep = '-' x $n;
then it does indeed print "A-B-X:C:D". But best of all, if you make $sep
initially longer, then shrink it:
my $n = 3;
my $sep = '-' x $n;
then the output includes the \0's now at pv[1] in $sep's string buffer:
$ perl5380 ~/tmp/p | od -c
0000000 A - - - B - - - X : \0 - C : \0 -
0000020 D \n
Personally I think we should:
* declare in the documentation that the behaviour of join is undefined if
the value of the separator changes during the course of the join. This
leaves us with flexibility in future for whatever optimisations or
overload adding or whatever we later choose to do.
* the code should be robust enough to handle the separator changing
without being potentially insecure or crashworthy.
So in particular, each time before the delimiter is used, it should be
SvPV'ed anew. It might not even be a valid string any more. Or as an
optimisation, only SvPV() it anew after each time you encounter a SMG
argument?
The alternative is to make a copy of the delimiter once at the beginning,
but that is relatively expensive- more expensive I suspect, than doing
SvPV each time.
A hybrid approach might to be observe that a delimiter is often only a
single byte long - if so, grab the char at the beginning, then use that
each time?
Or perhaps, after the delimiter is appended to the target string for the
first time, use that part of the target string as the source for any
future delimiter appends? The code would have to be careful to cope with
utf8 changes and the buffer being reallocated.
…--
"You may not work around any technical limitations in the software"
-- Windows Vista license
|
I think we should declare in the docs that ties and overloads on the separator will be triggered at most once, and that the value of the separator will not change for the lifetime of the join operation. Anything else means the user has no "fast" way to stringify and concatenate a set of items. Put alternatively I think join should be implemented such that the user can assume it is the fastest way to concatenate a set of items, and that every aspect of its implementation should be optimized for speed. Part of the issue here is that there are multiple valid interpretations of what join does in terms of the operators it represents. So for instance I can think of multiple different interpretations of
Personally I would vote for #7 or #8, as under those interpretations join() would NEVER return an overloaded object as it would be the result of joining the stringified version of its arguments (which IMO should never return anything but a string). The point being that I would prefer we leave it completely undefined what order the arguments are stringified in, and what order they are joined in so that we are allowed to implement join in whatever way would achieve maximum speed. I understand that there are those that want join() to be equivalent to some particular operator sequence, but we have never defined that sequence, and there are multiple valid optrees for this type of expression, which would be equally valid depending on the type of parser used to process the expression.
I strongly agree. This has come up in other places, like named user defined unicode sequences in the regex engine.
This makes sense to me personally and how i would expect it to be done. Yves |
I agree with this for tied variables. "At most once" is also backwards compatible with the current behaviour. (@demerphq: I'll reply separately to the rest of your points.) |
@demerphq I didn't list "overload" in my reply above, because I wanted to gather my thoughts and not reply too quickly. However, if we say "ties will be triggered at most once", then there's definitely a point to be made about also "applying the concat overload on the separator at most once", if only for consistency regarding side-effects. So I've come to agree with the full sentence I quoted. 🎉 My wording would be:
So, the separator should be stringified at most once, if needed to perform the Here's what I mean by "if needed": while looking at the code with @leonerd, we discovered that It's rather easy to check the size of the argument list, so I think we should actually modify the |
@book I think there are a couple of problems with this:
First there is existing behavior, join on a single item returns the stringified form of the item, possibly throwing warnings if that item is undefined.
Additionally, I would expect all of the following to return the same thing:
My point is I would expect that Also worth considering is what happens with an empty list. I would expect join to return the empty string when joining 0 items.
So I think that join() must continue to return a string. I dont think it should ever return an object or pass through an argument. I think there is room to debate how many times the separator should be stringified, but not the return from the operator. |
The conversation on what About this issue, if everyone agrees with:
(i.e. what I wrote above, without the pass-through part) then:
|
Isn't this a step away from ppc0013? ppc0013 states:
but the above means that if only It also means that unlike the reduce(...) the overloading on For the trivial common case where
If |
Yes, it is a step away from PPC 0013. It is also a step forward in a negotiation with @demerphq, who doesn't want PPC 0013 to happen at all. (Quoting an earlier message of his: "I think that join() must continue to return a string. I dont think it should ever return an object or pass through an argument.") However, I'd rather have this conversation outside of this issue, which is about an actual bug in the current implementation of |
@book, FWIW, its not that I /want/ to block PSC 0013, it is because I think it is sufficiently flawed that we are /required/ to block it until those flaws can be resolved. I agree this ticket isnt the forum for that discussion however. I will try to find time in the next day or two to post something which summarizes the problems I see in it. |
This code had a few problems: - changes to the content of delim from set or overload magic could result in the separator between elements changing during the process of the join. - changes to the content of delim which allocated a new PVX resulted in access to freed memory - changes to the flags of delim, the UTF-8 flag in particular, could result in an invalid joined string, either mojibake or an invalidly encoded upgraded string To avoid that, we copy the separator, either into a local buffer if it's large enough, or an allocated buffer, and save the flag we use, to prevent changes to the delim SV from changing or invalidating the delimpv value. Fixes Perl#21458 and some similar problems.
This code had a few problems: - changes to the content of delim from set or overload magic could result in the separator between elements changing during the process of the join. - changes to the content of delim which allocated a new PVX resulted in access to freed memory - changes to the flags of delim, the UTF-8 flag in particular, could result in an invalid joined string, either mojibake or an invalidly encoded upgraded string To avoid that, we copy the separator, either into a local buffer if it's large enough, or an allocated buffer, and save the flag we use, to prevent changes to the delim SV from changing or invalidating the delimpv value. Fixes #21458 and some similar problems.
This is a bug report for perl, generated with the help of perlbug 1.43 running under perl 5.39.3.
Using a self-modifying tied variable as the separator in
join
gives unexpected results.Description
While implementing PPC 0013, I've been exploring the behaviour of
join
with overloaded separator and arguments, and also with tied separator and arguments.This case is about using a self-modifying (in
FETCH
) tied variable both as a separator and multiple times in the argument list.Expected behavior
The following (passing) test case sets expectations:
FETCH
is called once on a tied separator, and the returned string is used to join the argumentsjoin
with a single argument detected at compilation is replaced with anOP_STRINGIFY
, soFETCH
is never calledFETCH
is called once to get the separating string, and once for each occurence of the variable in the argument listSteps to Reproduce
When the tied variable is self-modifying in
FETCH
, the result does not seem to follow the above expectations:In the last call to
join
,FETCH
is called the expected number of times, but the result is not what we expect, based on the behaviour described in the first case:versus:
Flags
Perl configuration
The text was updated successfully, but these errors were encountered: