Request for assistance: Unwinding stacks on exceptions

--Sig_/t2dXFypsG7NTAzVqhBF_reo
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: quoted-printable

TL;DR: die()ing from inside call_sv()ed code doesn't pop the CXt_SUB
  frame with retop=3D=3DNULL - why?


Background: Today I have a question, related to my Future::AsyncAwait
  work, for which I now have a TPF grant :) I'm looking into a failure
  case wherein the actual C-level stack (as shown by e.g. gdb's `bt`)
  disagrees with the Perl-level context ("CX") stack, after a certain
  condition. Looking at this condition, I can't really see what I'm
  doing any different from any other normal (XS) perl code, so I
  thought I'd ask here if anyone has any insight.

The scenario concerns the operation of the API function `call_sv()` (and
friends like `call_method()` which are just wrappers of it).

Under normal non-exceptional circumstances in which no code dies, the C
and Perl-level context stacks all agree.

For example:

  /* some amount of context stack is here; cxstack_ix has a value */

  PUSHMARK(SP);
  XPUSHs(some arguments...);
  PUTBACK;

  call_sv(some CV ref..., G_SCALAR);

  /* at this point, cxstack_ix has returned to the same height */

During the operation of `call_sv()` the context stack gets more entries
pushed to it, but then they get popped again by the time it returns, so
the whole operation appears nice and neat. At the C level, the
`call_sv()` function appeared on the C stack, resulting in a nested
call to the main runops loop, which itself eventually terminated,
causing the thing to return. If you consider the ranges of entries on
the C and Perl CX stacks, they are nicely inter-related regions. All
well and good.

Internally, what `call_sv()` does is:
 1) creates a new temporary LOGOP with zeroed-out fields
 2) sets PL_op to point to it
 3) PUSHs()s the SV to invoke onto the args stack

Then in this case where the G_EVAL flag isn't set (as here):
 4) calls CALL_BODY_SUB() macro to invoke it - which itself invokes
    OP_ENTERSUB.
 5) OP_ENTERSUB does a bunch of things, one key part of which is to
    call `cx_pushblock()` + `cx_pushsub()` to create the CXt_SUB
    context on the CX stack. Since the PL_top is the new temporary op
    created at step 1, the PL_op->op_next field will be NULL, and thus
    this new CXt_SUB entry will have NULL as its retop field. This part
    is critical.
 6) CALL_SUB_BODY() then calls CALLRUNOPS(), which invokes the main
    runop loop a nested time on the C stack.
 7) The main runop loop runs the body of the SV, which invokes an
    OP_LEAVESUB.
 8) This OP_LEAVESUB will pop all the context frames that OP_ENTERSUB
    put there in step 5, ultimately returning the retop pointer back to
    the runop loop. Due to the NULLed field in step 1, this will be
    a NULL pointer.
 9) This NULL pointer is a signal to the (inner nested) runop loop to
    stop its work. It thus returns to the call_sv() function that
    invoked it.

At this point, the C stack has been wound back to how we started, as
has the Perl CX stack. The two are in agreement.

What I now don't understand, is what happens on an exception; what
happens if the code body inside the SV we're `call_sv()`'ing in fact
dies?

What I am discovering in my scenario in Future::AsyncAwait, is that the
OP_DIE causes a large amount of C stack unwinding as a result of the
setjmp/longjmp magics around the JMPENV_* macros, such that the
`call_sv()` frame is unwound completely. At this point the interpreter
has now returned to the _outer_ toplevel runops loop - the one invoked
by Perl itself to run the toplevel of my actual script. But it hasn't
unwound the CX stack, so that inner CXt_SUB frame still exists on there
- the one with the NULL retop. The C and Perl CX stacks have become
misaligned.

As a result of this, the outer runop loop continues to invoke code
until the next time someone tries to OP_LEAVESUB back past that frame
that should have been tidied away. This causes the main runop loop to
return NULL, thus signalling to Perl that the main program is finished
and it's time to invoke an orderly END and global-destruction time.

This results in the failure seen at

  https://rt.cpan.org/Ticket/Display.html?id=3D126037

Now what all puzzles me here is that I'm not really doing anything odd
with my `call_sv()` (well, `call_method()`) calls here. Comparing my
code with lots of other XS examples (e.g. of which I've written many
myself), I don't see any other code which has this trouble. I seem to
be unique here in F:AA in arriving at this disconnection.

Is anyone able to shed any light on this puzzle?

--=20
Paul "LeoNerd" Evans

leonerd@leonerd.org.uk      |  https://metacpan.org/author/PEVANS
http://www.leonerd.org.uk/  |  https://www.tindie.com/stores/leonerd/

--Sig_/t2dXFypsG7NTAzVqhBF_reo
Content-Type: application/pgp-signature
Content-Description: OpenPGP digital signature

-----BEGIN PGP SIGNATURE-----

iF0EARECAB0WIQQACtfoNPkrOD+dkiu8tLZMLxwGjQUCXC4pNwAKCRC8tLZMLxwG
jc1nAJ4iQW4iSTgMZwi1j6MOnK+0ZPL9BgCggqg5HAHbEfrLo3NijcD+cZgZMW4=
=klLE
-----END PGP SIGNATURE-----

--Sig_/t2dXFypsG7NTAzVqhBF_reo--
0
leonerd
1/3/2019 3:24:39 PM
perl.perl5.porters 47522 articles. 0 followers. Follow

1 Replies
19 Views

Similar Articles

[PageSpeed] 27

On Thu, 3 Jan 2019 15:24:39 +0000
"Paul \"LeoNerd\" Evans" <leonerd@leonerd.org.uk> wrote:

> Is anyone able to shed any light on this puzzle?

For posterity: It turns out the missing trick was docatch().

In particular, a regular OP_ENTERTRY contains a rather subtle line of
code at the beginning:

      RUN_PP_CATCHABLY(Perl_pp_entertry);

this macro contains a fancy trick:

  #define RUN_PP_CATCHABLY(thispp) \
      STMT_START { if (CATCH_GET) return docatch(thispp); } STMT_END

If CATCH_GET is true, the entire PP function is invoked via docatch(),
and the result of that is returned. This has the effect of running the
op function within its own runloop, to ensure catch happens properly.

It was this subtle trick that I was missing when I restored a CXt_EVAL
frame that I had previously unwound, as part of the suspend/resume
logic. Having put that in, this bug is now fixed.

-- 
Paul "LeoNerd" Evans

leonerd@leonerd.org.uk      |  https://metacpan.org/author/PEVANS
http://www.leonerd.org.uk/  |  https://www.tindie.com/stores/leonerd/
0
leonerd
1/4/2019 4:58:31 PM
Reply: