[PATCH] Save and restore captures for BRANCH/TRIE failures

[I posted this to RT, but I haven't seen it come through p5p, so
sending this one directly too.  Sorry if this is a duplicate!  Yves &
Dave -- would you be willing to take a look at this patch?  I'd love
to get your feedback if you wouldn't mind!]

This is a first pass at a patch to fix RT #133352, to save and restore
all captures during alternations (BRANCH and TRIE) to avoid incorrectly
returning a capture from a failed branch, as happens with the following
test case, which has been broken ever since Perl 5.000 in 1994:

     perl -e 'print "$2\n" if "afoobar" =~ /^((.)foo|bar)*$/;'

Perl 2.0 through Perl 5.0 alpha 9 all print "a" for this command, while
Perl 5.000 through today's blead all incorrectly print "b" instead.

This patch appears to work (making the test case above print "a" again),
and I am submitting it as a preliminary patch for discussion purposes;
I don't consider it ready to apply yet, for a number of reasons:

* I haven't finished analyzing the code to check for edge cases that
  I may have overlooked with this patch.  Since TRIE_next_fail included
  a test for ST.jump which would call UNWIND_PAREN(), I want to better
  understand how that was working and determine whether ST.jump being
  set might have any implications for this patch.

* I haven't developed any new regression tests to exercise this bug and
  the patch to fix it, and I know that both BRANCH and TRIE variations
  need to be tested.

* This patch is currently saving ALL captures, including captures that
  can't possibly change inside the alternation.  This is an easier fix,
  but I believe the performance impact should be mitigated, since this
  would impact many everyday regular expressions which aren't affected
  by this bug in the first place.

* I believe that setting a paren floor (which CURLYX does) would likely
  mitigate most (but not all) of the performance impact of this fix,
  but I haven't yet found a clean way to implement this.

* Another possibility might be to create new regnode types to use only
  for alternations containing capture groups, maybe only for ones which
  also have a quantifier applied, since I believe both are required to
  trigger this bug.  (This might have to be treated as an optimization
  in regcomp.c to use the old logic for alternations that can't trigger
  the bug, but it might be hard to tell if a quantifier is applied for
  qr// cases.)

* Regardless, I would like feedback from an expert on this part of the
  regular expression engine (maybe Yves?) about this patch and whether
  the approach I've taken is on the right track.
---
 regexec.c | 28 ++++++++++++++++------------
 regexp.h  |  3 +++
 2 files changed, 19 insertions(+), 12 deletions(-)

diff --git a/regexec.c b/regexec.c
index ba52ae9..1d42b0d 100644
--- a/regexec.c
+++ b/regexec.c
@@ -5920,6 +5920,10 @@ S_regmatch(pTHX_ regmatch_info *reginfo, char
*startpos, regnode *prog)
                HV * widecharmap = MUTABLE_HV(rexi->data->data[ ARG(
scan ) + 1 ]);
                 U32 state = trie->startstate;

+               /* save state of captures at branch start */
+               ST.cp = regcppush(rex, 0, maxopenparen);
+               REGCP_SET(ST.lastcp);
+
                 if (scan->flags == EXACTL || scan->flags == EXACTFLU8) {
                     _CHECK_AND_WARN_PROBLEMATIC_LOCALE;
                     if (utf8_target
@@ -6067,15 +6071,10 @@ S_regmatch(pTHX_ regmatch_info *reginfo, char
*startpos, regnode *prog)
        case TRIE_next_fail: /* we failed - try next alternative */
         {
             U8 *uc;
-            if ( ST.jump ) {
-                /* undo any captures done in the tail part of a branch,
-                 * e.g.
-                 *    /(?:X(.)(.)|Y(.)).../
-                 * where the trie just matches X then calls out to do the
-                 * rest of the branch */
-                REGCP_UNWIND(ST.cp);
-                UNWIND_PAREN(ST.lastparen, ST.lastcloseparen);
-           }
+
+           /* restore state of captures from branch start */
+           regcp_restore(rex, ST.lastcp, &maxopenparen);
+
            if (!--ST.accepted) {
                DEBUG_EXECUTE_r({
                     Perl_re_exec_indentf( aTHX_  "%sTRIE failed...%s\n",
@@ -8043,7 +8042,10 @@ NULL
            ST.lastparen = rex->lastparen;
            ST.lastcloseparen = rex->lastcloseparen;
            ST.next_branch = next;
-           REGCP_SET(ST.cp);
+
+           /* save state of captures at branch start */
+           ST.cp = regcppush(rex, 0, maxopenparen);
+           REGCP_SET(ST.lastcp);

            /* Now go into the branch */
            if (has_cutgroup) {
@@ -8077,8 +8079,10 @@ NULL
                do_cutgroup = 0;
                no_final = 0;
            }
-           REGCP_UNWIND(ST.cp);
-            UNWIND_PAREN(ST.lastparen, ST.lastcloseparen);
+
+           /* restore state of captures from branch start */
+           regcp_restore(rex, ST.lastcp, &maxopenparen);
+
            scan = ST.next_branch;
            /* no more branches? */
            if (!scan || (OP(scan) != BRANCH && OP(scan) != BRANCHJ)) {
diff --git a/regexp.h b/regexp.h
index 44409f0..aa61c97 100644
--- a/regexp.h
+++ b/regexp.h
@@ -754,6 +754,7 @@ typedef struct regmatch_state {
            U32 lastparen;
            U32 lastcloseparen;
            CHECKPOINT cp;
+           CHECKPOINT lastcp;

         } branchlike;

@@ -763,6 +764,7 @@ typedef struct regmatch_state {
            U32 lastparen;
            U32 lastcloseparen;
            CHECKPOINT cp;
+           CHECKPOINT lastcp;

            regnode *next_branch; /* next branch node */
        } branch;
@@ -773,6 +775,7 @@ typedef struct regmatch_state {
            U32 lastparen;
            U32 lastcloseparen;
            CHECKPOINT cp;
+           CHECKPOINT lastcp;

            U32         accepted; /* how many accepting states left */
            bool        longfold;/* saw a fold with a 1->n char mapping */
--
2.7.4
0
deven
7/11/2018 2:31:18 PM
perl.perl5.porters 47265 articles. 0 followers. Follow

0 Replies
9 Views

Similar Articles

[PageSpeed] 59

Reply: