Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
PATCH: [perl #125990] panic: reg_node overrun
This is a result of a design flaw that I introduced in earlier releases when attempting to fix earlier design flaws in dealing with the outlier character ß, LATIN SMALL LETTER SHARP S. The uppercase of this letter is SS, so that when comparing case-insensitively, it should match 'ss', and hence, in Unicode terminology, it folds to 'ss'. This character is the only one representable without using UTF-8 whose fold is longer than 1 byte, and so has to have special treatment. Similarly, the sequence 'ss' can match caselessly the single byte ß, and this is the only such sequence that can match something shorter than it, unless UTF-8 is involved. The matter is complicated by the fact that under /di rules, the ß and 'ss' don't match each other, unless the target string is in UTF-8. The solution I used earlier (and continue to use) was to create a special regnode EXACTFU_SS under /ui rules, in which any ß is folded to 'ss'. But under /di rules, a regular EXACTF regnode is used, and any ß is retained as-is. The problem reported here arises when something during the sizing pass tells perl to use /ui rules rather than the /di rules that were in effect at the beginning. Recall that perl uses /d rules, for backward compatibility, unless something overrides them. This can be a 'use' declaration, an explicit character set pattern modifier, or something in the pattern. This bug happens only with the final case. There are several Unicode-defined constructs that can occur in patterns; if one is found, the perl interpreter infers that Unicode is desired, and switches from /d to /u for the whole pattern. Two such constructs are a Unicode property, \p{}, and a Unicode named character, \N{}. The problem-reproducing code for this ticket uses the latter. The problem was that the switch from /di to /ui was deferred until AFTER the sizing pass. (A flag was set when one of these constructs was encountered to tell the parser to later do the switch.) During the second pass, the code realizes it is under /ui, so creates an EXACTFU_SS node and folds the ß into 'ss'. But the first pass thought it was under /di, so it sized for just the ß, i.e., for 1 byte, so we exceed the allocated space and do a wild write. This may not cause a problem if the malloc'd space had rounded-up and there were only a few of these ß characters. One solution I considered was just keeping a global count of the ß characters in EXACTF nodes. One could just add these to the space reserved if /ui rules ended up being used. The problem with this is that nodes that are near their maximum size without the extra space could exceed it with, and thus have to be split into 2 nodes, and the extra node would have an unplanned-for header, taking up more unaccounted-for space. So that doesn't work. One could also just reserve two bytes for every ß in an EXACTF node, thus wasting space unless /ui ends up being used. But the bigger problem is that the code that splits nodes would have to be made more complicated. It has to find a suitable splitting spot, by searching through the text of the node, and now it would have to deal with some of that space not being set. Instead, I opted to change the code so that when it finds one of these Unicode-defined constructs, it switches to /u immediately during the sizing pass. That means that the parse afterwards knows that it is /u and allocates the correct space. (We now have to remain in /u for the remainder of the pass, so some code had to change that reverted this.) This fixes the test case in the ticket. But there remains a problem if the sizing has happened earlier in the parse before the construct that changes from /d to /u is encountered. Like: qr/.....ß....\N{}/di The incorrect sizing has already happened by the time the \N{} is encountered. One could solve this by restarting the parse whenever the /d goes to /u (under /i, as this issue isn't a problem except when folding ß). That slows things down. Instead, I opted to set a global flag whenever a ß is found in an EXACTF node. If that flag isn't set at the time of the /d to /u switch, there's no need to restart the parse. A 'use utf8' or 'use 5.012' or higher selects /u over /d, so the problem did not happen with them, nor if the pattern has to be converted to UTF-8, which restarts the sizing pass, and it only happens with the sharp s character. And probably unless there a several ß characters, the rounding-up of malloc space, would cause this to not be an issue. These explain why this hasn't been reported from the field.
- Loading branch information