Teaching Rakudo the tricks of Perl 5's regex optimiser

I'm cheating here - I'm using an e-mail message as a way to publish some notes
for Jonathan (or *anyone else interested*):

Jonathan did a talk in Riga, Perl 6 performance update,
https://perlcon.eu/talk/80
The (re-uploaded) live stream is at
https://www.youtube.com/watch?v=S5iVBlk7pdg#t=4h39m

One thing that this talk revealed is that (currently) Perl 5 beats
Rakudo on various regular expressions. What *I* know is that this is because
Perl 5 cheats - it has an "optimiser", which happens to automatically do
what jnthn then showed manually implemented in some of his benchmarks.

You can see what the Perl 5 regex optimiser is doing by using the re pragma.
I'm no expert on this thing, but playing with it, I can see that it can do
various things

0) Know that it can't do anything useful (it is skipped)
1) Know that a fixed string must be present
   Look for it (which is fast), not find it, and immediately return "no match"
2) Know that a fixed string must be present
   Look for it find it, carry on into the real engine
   (which arguably is slower than no optimiser)
3) Know that a fixed string must be present, *and* is the entire match
   Look for it (which is fast), find it, and immediately return "match".

Clearly cases (1) and (3) are the big wins here. Case (3) was the one I had
forgotten about - the cheating is so good that often the engine is never
entered.
   

So, if you run this:

$ perl -wle 'use re "debug"; print "Perl rules" =~ /^Perl/ ? "Y" : "n"'
Compiling REx "^Perl"
Final program:
   1: SBOL /^/ (2)
   2: EXACT <Perl> (4)
   4: END (0)
anchored "Perl" at 0 (checking anchored noscan) anchored(SBOL) minlen 4
Matching REx "^Perl" against "Perl rules"
Intuit: trying to determine minimum start position...
  Looking for check substr at fixed offset 0...
Intuit: Successfully guessed: match at offset 0
   0 <> <Perl rules>         |   0| 1:SBOL /^/(2)
   0 <> <Perl rules>         |   0| 2:EXACT <Perl>(4)
   4 <Perl> < rules>         |   0| 4:END(0)
Match successful!
Y
Freeing REx: "^Perl"


"Final program" describes how the regex ended up being compiled. This isn't
really of interest here.

What is of interest is the parts between the two lines "Intuit". That's the
optimiser. For the case I specifically chose here, what we have is that

1) the optimiser knows that the fixed string "Perl" must be in target string
2) which it searches for and finds
3) at which point it calls the main engine (these lines):

   0 <> <Perl rules>         |   0| 1:SBOL /^/(2)
   0 <> <Perl rules>         |   0| 2:EXACT <Perl>(4)
   4 <Perl> < rules>         |   0| 4:END(0)

(that part isn't actually of interest here. I'm just noting it as "this is
the main engine, and comparable to what Rakudo does).


The interesting cases are things like:

$ perl -wle 'use re "debug"; print "Perl rules" =~ /er./ ? "Y" : "n"'
Compiling REx "er."
Final program:
   1: EXACT <er> (3)
   3: REG_ANY (4)
   4: END (0)
anchored "er" at 0 (checking anchored) minlen 3
Matching REx "er." against "Perl rules"
Intuit: trying to determine minimum start position...
  doing 'check' fbm scan, [0..9] gave 1
  Found anchored substr "er" at offset 1 (rx_origin now 1)...
  (multiline anchor test skipped)
  try at offset...
Intuit: Successfully guessed: match at offset 1
   1 <P> <erl rules>         |   0| 1:EXACT <er>(3)
   3 <Per> <l rules>         |   0| 3:REG_ANY(4)
   4 <Perl> < rules>         |   0| 4:END(0)
Match successful!
Y
Freeing REx: "er."


You can see that
1) the compiler records that the longest fixed string is 'er'
2) the optimiser looks for this before even hitting the real engine
3) the optimiser tells the real engine that it doesn't even need to consider
   trying to match at string offset 0


and this:

$ perl -wle 'use re "debug"; print "Perl rules" =~ /er/ ? "Y" : "n"'
Compiling REx "er"
Final program:
   1: EXACT <er> (3)
   3: END (0)
anchored "er" at 0 (checking anchored isall) minlen 2
Matching REx "er" against "Perl rules"
Intuit: trying to determine minimum start position...
  doing 'check' fbm scan, [0..10] gave 1
  Found anchored substr "er" at offset 1 (rx_origin now 1)...
  (multiline anchor test skipped)
  try at offset...
Intuit: Successfully guessed: match at offset 1
Y
Freeing REx: "er"


The optimiser matches, and knows that if its crib[1] matches, the entire
regex must match, so it doesn't even call into the engine.


and this:

$ perl -wle 'use re "debug"; print "Perl rules" =~ /Good *, #MoarVM/ ? "Y" : "n"'
Compiling REx "Good *, #MoarVM"
Final program:
   1: EXACT <Good> (3)
   3: STAR (6)
   4:   EXACT < > (0)
   6: EXACT <, #MoarVM> (10)
  10: END (0)
anchored "Good" at 0 floating ", #MoarVM" at 4..9223372036854775807 (checking floating) minlen 13
n
Freeing REx: "Good *, #MoarVM"

The optimiser knows that the pattern can't match less than 13 characters, but
the offered string is too short.

The optimiser also knows at what offsets the within the target string the
fixed string must lie, if there is also some variable length stuff, but this
is getting beyond what I know the plan for:

$ perl -wle 'use re "debug"; print "Perl rules" =~ /Perl.*s/ ? "Y" : "n"'
Compiling REx "Perl.*s"
Final program:
   1: EXACT <Perl> (3)
   3: STAR (5)
   4:   REG_ANY (0)
   5: EXACT <s> (7)
   7: END (0)
anchored "Perl" at 0 floating "s" at 4..9223372036854775807 (checking anchored) minlen 5
Matching REx "Perl.*s" against "Perl rules"
Intuit: trying to determine minimum start position...
  doing 'check' fbm scan, [0..9] gave 0
  Found anchored substr "Perl" at offset 0 (rx_origin now 0)...
  doing 'other' fbm scan, [4..10] gave 9
  Found floating substr "s" at offset 9 (rx_origin now 0)...
  (multiline anchor test skipped)
Intuit: Successfully guessed: match at offset 0
   0 <> <Perl rules>         |   0| 1:EXACT <Perl>(3)
   4 <Perl> < rules>         |   0| 3:STAR(5)
                             |   0| REG_ANY can match 6 times out of 2147483647...
   9 <Perl rule> <s>         |   1|  5:EXACT <s>(7)
  10 <Perl rules> <>         |   1|  7:END(0)
Match successful!
Y
Freeing REx: "Perl.*s"


If you're wearing Peril Sensitive Sunglasses, then the code that does this is
the self-recursive Perl_re_intuit_start() in
https://perl5.git.perl.org/perl.git/blob/HEAD:/regexec.c

As a guide to reading that, for things like

        DEBUG_EXECUTE_r(Perl_re_printf( aTHX_
			      "  String too short...\n"));


DEBUG_EXECUTE_r() only generates that debugging output if the command line
flag -Dr is set, *and* the `use re "debug";` de facto sets -Dr, even on a
perl binary compiled without all the other -D flags enabled.


I know that PCRE also exposes some information it derives from the pattern
which can be used to pre-match -- see PCRE_INFO_FIRSTCHARACTER and
PCRE_INFO_REQUIREDCHAR in
https://www.pcre.org/original/doc/html/pcre_fullinfo.html

I *thought* that Google's re2 library (which came out of code search) could
return the longest fixed string (if any) as part of its API, but I can't find
that now. In https://swtch.com/~rsc/regexp/regexp4.html it mentions building
trigrams (directly) from parsing the regex (with ANDs and ORs) which is what
some $ork code is doing (but then using PCRE, not re2).

Anyway, I hope that this helps someone, and helps make Rakudo's parser faster

I assume that the useful strategy is roughly

0) Do the most simple thing - no alternations, anchored starts. Reject match
   if not present
1) Then 1+ character fixed substrings - Reject match if not present
2) Then the big cheat of "string is actually fixed" - compile to a sub that
   build a match result immediately, instead of the more general engine
3) Then the even more general "fixed string as part of larger regex", where
   failure is a fast-path, but otherwise the regular regex engine follows
4) Minimum match length
5) The stuff where one starts to tell the main engine how


but of course no plan survives contact with the enemy.

Nicholas Clark

1: https://en.wiktionary.org/wiki/crib -- definitions 14 and 23:
   (usually in the plural) A collection of quotes or references for
   use in speaking, for assembling a written document, or as an aid to
   a project of some sort; a crib sheet.
   [and]
   A cheat sheet or past test used by students; crib sheet.
0
nick
8/13/2019 9:36:51 AM
perl.perl6.compiler 1228 articles. 0 followers. Follow

5 Replies
7 Views

Similar Articles

[PageSpeed] 21

--000000000000dc826305900081cf
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

I would like to point out that regexes in Perl6 are treated as code.
(They can even have parameters and lexical variables.)

It uses the same compiler intrinsics as the rest of Perl6.
It uses the same VM opcodes as regular Perl6 code.

Regexes, string literals, and signatures are just domain specific
sub-languages.

Perl6 regexes are also missing features that are common to other regexes,
instead you just embed regular Perl6 code to handle it.

For example instead of `(?=3D=E2=80=A6)` Perl6 uses a method call to `befor=
e` with a
regex argument.

    <.before =E2=80=A6>

    use v5;
    'abcd' =3D~ / . (?=3Dc) . /x # "bc"

    use v6;
    'abcd' ~~ / . <.before( /c/ )> .  / # "bc"
    'abcd' ~~ / . <.before   c   > .  / # "bc" # (exactly identical)

A person could change the code in the `before` method to have it do
something different

This means that the biggest wins in the Perl6 regex system are probably
going to look nothing like the ones in other regex engines.
At least in the short term.
Inlining method calls for instance.
Perhaps the biggest one may be the one about passing around =E2=80=9Cfates=
=E2=80=9D. (I
barely understand the basics of this.)

So while information on how other projects do regex optimizations can help,
they are generally going to be small wins in comparison.
(That doesn't mean they aren't worth doing.)

Doing some of those optimizations may also make other bigger optimizations
harder to do.
(I'm fairly sure the =E2=80=9Cfates=E2=80=9D optimization would be harder, =
but I'm not
sure.)

On Tue, Aug 13, 2019 at 4:53 AM Nicholas Clark <nick@ccl4.org> wrote:

> I'm cheating here - I'm using an e-mail message as a way to publish some
> notes
> for Jonathan (or *anyone else interested*):
>
> Jonathan did a talk in Riga, Perl 6 performance update,
> https://perlcon.eu/talk/80
> The (re-uploaded) live stream is at
> https://www.youtube.com/watch?v=3DS5iVBlk7pdg#t=3D4h39m
>
> One thing that this talk revealed is that (currently) Perl 5 beats
> Rakudo on various regular expressions. What *I* know is that this is
> because
> Perl 5 cheats - it has an "optimiser", which happens to automatically do
> what jnthn then showed manually implemented in some of his benchmarks.
>
> You can see what the Perl 5 regex optimiser is doing by using the re
> pragma.
> I'm no expert on this thing, but playing with it, I can see that it can d=
o
> various things
>
> 0) Know that it can't do anything useful (it is skipped)
> 1) Know that a fixed string must be present
>    Look for it (which is fast), not find it, and immediately return "no
> match"
> 2) Know that a fixed string must be present
>    Look for it find it, carry on into the real engine
>    (which arguably is slower than no optimiser)
> 3) Know that a fixed string must be present, *and* is the entire match
>    Look for it (which is fast), find it, and immediately return "match".
>
> Clearly cases (1) and (3) are the big wins here. Case (3) was the one I h=
ad
> forgotten about - the cheating is so good that often the engine is never
> entered.
>
>
> So, if you run this:
>
> $ perl -wle 'use re "debug"; print "Perl rules" =3D~ /^Perl/ ? "Y" : "n"'
> Compiling REx "^Perl"
> Final program:
>    1: SBOL /^/ (2)
>    2: EXACT <Perl> (4)
>    4: END (0)
> anchored "Perl" at 0 (checking anchored noscan) anchored(SBOL) minlen 4
> Matching REx "^Perl" against "Perl rules"
> Intuit: trying to determine minimum start position...
>   Looking for check substr at fixed offset 0...
> Intuit: Successfully guessed: match at offset 0
>    0 <> <Perl rules>         |   0| 1:SBOL /^/(2)
>    0 <> <Perl rules>         |   0| 2:EXACT <Perl>(4)
>    4 <Perl> < rules>         |   0| 4:END(0)
> Match successful!
> Y
> Freeing REx: "^Perl"
>
>
> "Final program" describes how the regex ended up being compiled. This isn=
't
> really of interest here.
>
> What is of interest is the parts between the two lines "Intuit". That's t=
he
> optimiser. For the case I specifically chose here, what we have is that
>
> 1) the optimiser knows that the fixed string "Perl" must be in target
> string
> 2) which it searches for and finds
> 3) at which point it calls the main engine (these lines):
>
>    0 <> <Perl rules>         |   0| 1:SBOL /^/(2)
>    0 <> <Perl rules>         |   0| 2:EXACT <Perl>(4)
>    4 <Perl> < rules>         |   0| 4:END(0)
>
> (that part isn't actually of interest here. I'm just noting it as "this i=
s
> the main engine, and comparable to what Rakudo does).
>
>
> The interesting cases are things like:
>
> $ perl -wle 'use re "debug"; print "Perl rules" =3D~ /er./ ? "Y" : "n"'
> Compiling REx "er."
> Final program:
>    1: EXACT <er> (3)
>    3: REG_ANY (4)
>    4: END (0)
> anchored "er" at 0 (checking anchored) minlen 3
> Matching REx "er." against "Perl rules"
> Intuit: trying to determine minimum start position...
>   doing 'check' fbm scan, [0..9] gave 1
>   Found anchored substr "er" at offset 1 (rx_origin now 1)...
>   (multiline anchor test skipped)
>   try at offset...
> Intuit: Successfully guessed: match at offset 1
>    1 <P> <erl rules>         |   0| 1:EXACT <er>(3)
>    3 <Per> <l rules>         |   0| 3:REG_ANY(4)
>    4 <Perl> < rules>         |   0| 4:END(0)
> Match successful!
> Y
> Freeing REx: "er."
>
>
> You can see that
> 1) the compiler records that the longest fixed string is 'er'
> 2) the optimiser looks for this before even hitting the real engine
> 3) the optimiser tells the real engine that it doesn't even need to
> consider
>    trying to match at string offset 0
>
>
> and this:
>
> $ perl -wle 'use re "debug"; print "Perl rules" =3D~ /er/ ? "Y" : "n"'
> Compiling REx "er"
> Final program:
>    1: EXACT <er> (3)
>    3: END (0)
> anchored "er" at 0 (checking anchored isall) minlen 2
> Matching REx "er" against "Perl rules"
> Intuit: trying to determine minimum start position...
>   doing 'check' fbm scan, [0..10] gave 1
>   Found anchored substr "er" at offset 1 (rx_origin now 1)...
>   (multiline anchor test skipped)
>   try at offset...
> Intuit: Successfully guessed: match at offset 1
> Y
> Freeing REx: "er"
>
>
> The optimiser matches, and knows that if its crib[1] matches, the entire
> regex must match, so it doesn't even call into the engine.
>
>
> and this:
>
> $ perl -wle 'use re "debug"; print "Perl rules" =3D~ /Good *, #MoarVM/ ? =
"Y"
> : "n"'
> Compiling REx "Good *, #MoarVM"
> Final program:
>    1: EXACT <Good> (3)
>    3: STAR (6)
>    4:   EXACT < > (0)
>    6: EXACT <, #MoarVM> (10)
>   10: END (0)
> anchored "Good" at 0 floating ", #MoarVM" at 4..9223372036854775807
> (checking floating) minlen 13
> n
> Freeing REx: "Good *, #MoarVM"
>
> The optimiser knows that the pattern can't match less than 13 characters,
> but
> the offered string is too short.
>
> The optimiser also knows at what offsets the within the target string the
> fixed string must lie, if there is also some variable length stuff, but
> this
> is getting beyond what I know the plan for:
>
> $ perl -wle 'use re "debug"; print "Perl rules" =3D~ /Perl.*s/ ? "Y" : "n=
"'
> Compiling REx "Perl.*s"
> Final program:
>    1: EXACT <Perl> (3)
>    3: STAR (5)
>    4:   REG_ANY (0)
>    5: EXACT <s> (7)
>    7: END (0)
> anchored "Perl" at 0 floating "s" at 4..9223372036854775807 (checking
> anchored) minlen 5
> Matching REx "Perl.*s" against "Perl rules"
> Intuit: trying to determine minimum start position...
>   doing 'check' fbm scan, [0..9] gave 0
>   Found anchored substr "Perl" at offset 0 (rx_origin now 0)...
>   doing 'other' fbm scan, [4..10] gave 9
>   Found floating substr "s" at offset 9 (rx_origin now 0)...
>   (multiline anchor test skipped)
> Intuit: Successfully guessed: match at offset 0
>    0 <> <Perl rules>         |   0| 1:EXACT <Perl>(3)
>    4 <Perl> < rules>         |   0| 3:STAR(5)
>                              |   0| REG_ANY can match 6 times out of
> 2147483647...
>    9 <Perl rule> <s>         |   1|  5:EXACT <s>(7)
>   10 <Perl rules> <>         |   1|  7:END(0)
> Match successful!
> Y
> Freeing REx: "Perl.*s"
>
>
> If you're wearing Peril Sensitive Sunglasses, then the code that does thi=
s
> is
> the self-recursive Perl_re_intuit_start() in
> https://perl5.git.perl.org/perl.git/blob/HEAD:/regexec.c
>
> As a guide to reading that, for things like
>
>         DEBUG_EXECUTE_r(Perl_re_printf( aTHX_
>                               "  String too short...\n"));
>
>
> DEBUG_EXECUTE_r() only generates that debugging output if the command lin=
e
> flag -Dr is set, *and* the `use re "debug";` de facto sets -Dr, even on a
> perl binary compiled without all the other -D flags enabled.
>
>
> I know that PCRE also exposes some information it derives from the patter=
n
> which can be used to pre-match -- see PCRE_INFO_FIRSTCHARACTER and
> PCRE_INFO_REQUIREDCHAR in
> https://www.pcre.org/original/doc/html/pcre_fullinfo.html
>
> I *thought* that Google's re2 library (which came out of code search) cou=
ld
> return the longest fixed string (if any) as part of its API, but I can't
> find
> that now. In https://swtch.com/~rsc/regexp/regexp4.html it mentions
> building
> trigrams (directly) from parsing the regex (with ANDs and ORs) which is
> what
> some $ork code is doing (but then using PCRE, not re2).
>
> Anyway, I hope that this helps someone, and helps make Rakudo's parser
> faster
>
> I assume that the useful strategy is roughly
>
> 0) Do the most simple thing - no alternations, anchored starts. Reject
> match
>    if not present
> 1) Then 1+ character fixed substrings - Reject match if not present
> 2) Then the big cheat of "string is actually fixed" - compile to a sub th=
at
>    build a match result immediately, instead of the more general engine
> 3) Then the even more general "fixed string as part of larger regex", whe=
re
>    failure is a fast-path, but otherwise the regular regex engine follows
> 4) Minimum match length
> 5) The stuff where one starts to tell the main engine how
>
>
> but of course no plan survives contact with the enemy.
>
> Nicholas Clark
>
> 1: https://en.wiktionary.org/wiki/crib -- definitions 14 and 23:
>    (usually in the plural) A collection of quotes or references for
>    use in speaking, for assembling a written document, or as an aid to
>    a project of some sort; a crib sheet.
>    [and]
>    A cheat sheet or past test used by students; crib sheet.
>

--000000000000dc826305900081cf
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">I would like to point out that regexes in Perl6 are treate=
d as code.<div>(They can even have parameters and lexical variables.)</div>=
<div><div><br></div><div>It uses the same compiler intrinsics as the rest o=
f Perl6.</div><div>It uses the same VM opcodes as regular Perl6 code.</div>=
</div><div><br></div><div>Regexes, string literals, and signatures are just=
 domain specific sub-languages.</div><div><br></div><div>Perl6 regexes are =
also missing features that are common to other regexes, instead you just em=
bed regular Perl6 code to handle it.</div><div><br></div><div>For example i=
nstead of `(?=3D=E2=80=A6)` Perl6 uses a method call to `before` with a reg=
ex argument.</div><div><br></div><div>=C2=A0 =C2=A0 &lt;.before =E2=80=A6&g=
t;</div><div><br></div><div>=C2=A0 =C2=A0 use v5;</div><div>=C2=A0 =C2=A0 &=
#39;abcd&#39; =3D~ / . (?=3Dc) . /x # &quot;bc&quot;</div><div><br></div><d=
iv>=C2=A0 =C2=A0 use v6;</div><div>=C2=A0 =C2=A0 &#39;abcd&#39; ~~ / . &lt;=
..before( /c/ )&gt; .=C2=A0 / # &quot;bc&quot;</div>=C2=A0 =C2=A0 &#39;abcd&=
#39; ~~ / . &lt;.before=C2=A0 =C2=A0c=C2=A0 =C2=A0&gt; .=C2=A0 / # &quot;bc=
&quot; # (exactly identical)<div><br></div><div>A person could change the c=
ode in the `before` method to have it do something different</div><div></di=
v><div><br></div><div>This means that the biggest wins in the Perl6 regex s=
ystem are probably going to look nothing like the ones in other regex engin=
es.</div><div>At least in the short term.</div><div>Inlining method calls f=
or instance.</div><div>Perhaps the biggest one may be the one about passing=
 around =E2=80=9Cfates=E2=80=9D. (I barely understand the basics of this.)<=
/div><div><br></div><div>So while information on how other projects do rege=
x optimizations can help, they are generally going to be small wins in comp=
arison.<br></div><div>(That doesn&#39;t mean they aren&#39;t worth doing.)<=
/div><div><br></div><div>Doing some of those optimizations may also make ot=
her bigger optimizations harder to do.</div><div>(I&#39;m fairly sure the =
=E2=80=9Cfates=E2=80=9D optimization would be harder, but I&#39;m not sure.=
)</div></div><br><div class=3D"gmail_quote"><div dir=3D"ltr" class=3D"gmail=
_attr">On Tue, Aug 13, 2019 at 4:53 AM Nicholas Clark &lt;<a href=3D"mailto=
:nick@ccl4.org">nick@ccl4.org</a>&gt; wrote:<br></div><blockquote class=3D"=
gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(20=
4,204,204);padding-left:1ex">I&#39;m cheating here - I&#39;m using an e-mai=
l message as a way to publish some notes<br>
for Jonathan (or *anyone else interested*):<br>
<br>
Jonathan did a talk in Riga, Perl 6 performance update,<br>
<a href=3D"https://perlcon.eu/talk/80" rel=3D"noreferrer" target=3D"_blank"=
>https://perlcon.eu/talk/80</a><br>
The (re-uploaded) live stream is at<br>
<a href=3D"https://www.youtube.com/watch?v=3DS5iVBlk7pdg#t=3D4h39m" rel=3D"=
noreferrer" target=3D"_blank">https://www.youtube.com/watch?v=3DS5iVBlk7pdg=
#t=3D4h39m</a><br>
<br>
One thing that this talk revealed is that (currently) Perl 5 beats<br>
Rakudo on various regular expressions. What *I* know is that this is becaus=
e<br>
Perl 5 cheats - it has an &quot;optimiser&quot;, which happens to automatic=
ally do<br>
what jnthn then showed manually implemented in some of his benchmarks.<br>
<br>
You can see what the Perl 5 regex optimiser is doing by using the re pragma=
..<br>
I&#39;m no expert on this thing, but playing with it, I can see that it can=
 do<br>
various things<br>
<br>
0) Know that it can&#39;t do anything useful (it is skipped)<br>
1) Know that a fixed string must be present<br>
=C2=A0 =C2=A0Look for it (which is fast), not find it, and immediately retu=
rn &quot;no match&quot;<br>
2) Know that a fixed string must be present<br>
=C2=A0 =C2=A0Look for it find it, carry on into the real engine<br>
=C2=A0 =C2=A0(which arguably is slower than no optimiser)<br>
3) Know that a fixed string must be present, *and* is the entire match<br>
=C2=A0 =C2=A0Look for it (which is fast), find it, and immediately return &=
quot;match&quot;.<br>
<br>
Clearly cases (1) and (3) are the big wins here. Case (3) was the one I had=
<br>
forgotten about - the cheating is so good that often the engine is never<br=
>
entered.<br>
<br>
<br>
So, if you run this:<br>
<br>
$ perl -wle &#39;use re &quot;debug&quot;; print &quot;Perl rules&quot; =3D=
~ /^Perl/ ? &quot;Y&quot; : &quot;n&quot;&#39;<br>
Compiling REx &quot;^Perl&quot;<br>
Final program:<br>
=C2=A0 =C2=A01: SBOL /^/ (2)<br>
=C2=A0 =C2=A02: EXACT &lt;Perl&gt; (4)<br>
=C2=A0 =C2=A04: END (0)<br>
anchored &quot;Perl&quot; at 0 (checking anchored noscan) anchored(SBOL) mi=
nlen 4<br>
Matching REx &quot;^Perl&quot; against &quot;Perl rules&quot;<br>
Intuit: trying to determine minimum start position...<br>
=C2=A0 Looking for check substr at fixed offset 0...<br>
Intuit: Successfully guessed: match at offset 0<br>
=C2=A0 =C2=A00 &lt;&gt; &lt;Perl rules&gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0|=C2=A0 =C2=A00| 1:SBOL /^/(2)<br>
=C2=A0 =C2=A00 &lt;&gt; &lt;Perl rules&gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0|=C2=A0 =C2=A00| 2:EXACT &lt;Perl&gt;(4)<br>
=C2=A0 =C2=A04 &lt;Perl&gt; &lt; rules&gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0|=C2=A0 =C2=A00| 4:END(0)<br>
Match successful!<br>
Y<br>
Freeing REx: &quot;^Perl&quot;<br>
<br>
<br>
&quot;Final program&quot; describes how the regex ended up being compiled. =
This isn&#39;t<br>
really of interest here.<br>
<br>
What is of interest is the parts between the two lines &quot;Intuit&quot;. =
That&#39;s the<br>
optimiser. For the case I specifically chose here, what we have is that<br>
<br>
1) the optimiser knows that the fixed string &quot;Perl&quot; must be in ta=
rget string<br>
2) which it searches for and finds<br>
3) at which point it calls the main engine (these lines):<br>
<br>
=C2=A0 =C2=A00 &lt;&gt; &lt;Perl rules&gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0|=C2=A0 =C2=A00| 1:SBOL /^/(2)<br>
=C2=A0 =C2=A00 &lt;&gt; &lt;Perl rules&gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0|=C2=A0 =C2=A00| 2:EXACT &lt;Perl&gt;(4)<br>
=C2=A0 =C2=A04 &lt;Perl&gt; &lt; rules&gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0|=C2=A0 =C2=A00| 4:END(0)<br>
<br>
(that part isn&#39;t actually of interest here. I&#39;m just noting it as &=
quot;this is<br>
the main engine, and comparable to what Rakudo does).<br>
<br>
<br>
The interesting cases are things like:<br>
<br>
$ perl -wle &#39;use re &quot;debug&quot;; print &quot;Perl rules&quot; =3D=
~ /er./ ? &quot;Y&quot; : &quot;n&quot;&#39;<br>
Compiling REx &quot;er.&quot;<br>
Final program:<br>
=C2=A0 =C2=A01: EXACT &lt;er&gt; (3)<br>
=C2=A0 =C2=A03: REG_ANY (4)<br>
=C2=A0 =C2=A04: END (0)<br>
anchored &quot;er&quot; at 0 (checking anchored) minlen 3<br>
Matching REx &quot;er.&quot; against &quot;Perl rules&quot;<br>
Intuit: trying to determine minimum start position...<br>
=C2=A0 doing &#39;check&#39; fbm scan, [0..9] gave 1<br>
=C2=A0 Found anchored substr &quot;er&quot; at offset 1 (rx_origin now 1)..=
..<br>
=C2=A0 (multiline anchor test skipped)<br>
=C2=A0 try at offset...<br>
Intuit: Successfully guessed: match at offset 1<br>
=C2=A0 =C2=A01 &lt;P&gt; &lt;erl rules&gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0|=C2=A0 =C2=A00| 1:EXACT &lt;er&gt;(3)<br>
=C2=A0 =C2=A03 &lt;Per&gt; &lt;l rules&gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0|=C2=A0 =C2=A00| 3:REG_ANY(4)<br>
=C2=A0 =C2=A04 &lt;Perl&gt; &lt; rules&gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0|=C2=A0 =C2=A00| 4:END(0)<br>
Match successful!<br>
Y<br>
Freeing REx: &quot;er.&quot;<br>
<br>
<br>
You can see that<br>
1) the compiler records that the longest fixed string is &#39;er&#39;<br>
2) the optimiser looks for this before even hitting the real engine<br>
3) the optimiser tells the real engine that it doesn&#39;t even need to con=
sider<br>
=C2=A0 =C2=A0trying to match at string offset 0<br>
<br>
<br>
and this:<br>
<br>
$ perl -wle &#39;use re &quot;debug&quot;; print &quot;Perl rules&quot; =3D=
~ /er/ ? &quot;Y&quot; : &quot;n&quot;&#39;<br>
Compiling REx &quot;er&quot;<br>
Final program:<br>
=C2=A0 =C2=A01: EXACT &lt;er&gt; (3)<br>
=C2=A0 =C2=A03: END (0)<br>
anchored &quot;er&quot; at 0 (checking anchored isall) minlen 2<br>
Matching REx &quot;er&quot; against &quot;Perl rules&quot;<br>
Intuit: trying to determine minimum start position...<br>
=C2=A0 doing &#39;check&#39; fbm scan, [0..10] gave 1<br>
=C2=A0 Found anchored substr &quot;er&quot; at offset 1 (rx_origin now 1)..=
..<br>
=C2=A0 (multiline anchor test skipped)<br>
=C2=A0 try at offset...<br>
Intuit: Successfully guessed: match at offset 1<br>
Y<br>
Freeing REx: &quot;er&quot;<br>
<br>
<br>
The optimiser matches, and knows that if its crib[1] matches, the entire<br=
>
regex must match, so it doesn&#39;t even call into the engine.<br>
<br>
<br>
and this:<br>
<br>
$ perl -wle &#39;use re &quot;debug&quot;; print &quot;Perl rules&quot; =3D=
~ /Good *, #MoarVM/ ? &quot;Y&quot; : &quot;n&quot;&#39;<br>
Compiling REx &quot;Good *, #MoarVM&quot;<br>
Final program:<br>
=C2=A0 =C2=A01: EXACT &lt;Good&gt; (3)<br>
=C2=A0 =C2=A03: STAR (6)<br>
=C2=A0 =C2=A04:=C2=A0 =C2=A0EXACT &lt; &gt; (0)<br>
=C2=A0 =C2=A06: EXACT &lt;, #MoarVM&gt; (10)<br>
=C2=A0 10: END (0)<br>
anchored &quot;Good&quot; at 0 floating &quot;, #MoarVM&quot; at 4..9223372=
036854775807 (checking floating) minlen 13<br>
n<br>
Freeing REx: &quot;Good *, #MoarVM&quot;<br>
<br>
The optimiser knows that the pattern can&#39;t match less than 13 character=
s, but<br>
the offered string is too short.<br>
<br>
The optimiser also knows at what offsets the within the target string the<b=
r>
fixed string must lie, if there is also some variable length stuff, but thi=
s<br>
is getting beyond what I know the plan for:<br>
<br>
$ perl -wle &#39;use re &quot;debug&quot;; print &quot;Perl rules&quot; =3D=
~ /Perl.*s/ ? &quot;Y&quot; : &quot;n&quot;&#39;<br>
Compiling REx &quot;Perl.*s&quot;<br>
Final program:<br>
=C2=A0 =C2=A01: EXACT &lt;Perl&gt; (3)<br>
=C2=A0 =C2=A03: STAR (5)<br>
=C2=A0 =C2=A04:=C2=A0 =C2=A0REG_ANY (0)<br>
=C2=A0 =C2=A05: EXACT &lt;s&gt; (7)<br>
=C2=A0 =C2=A07: END (0)<br>
anchored &quot;Perl&quot; at 0 floating &quot;s&quot; at 4..922337203685477=
5807 (checking anchored) minlen 5<br>
Matching REx &quot;Perl.*s&quot; against &quot;Perl rules&quot;<br>
Intuit: trying to determine minimum start position...<br>
=C2=A0 doing &#39;check&#39; fbm scan, [0..9] gave 0<br>
=C2=A0 Found anchored substr &quot;Perl&quot; at offset 0 (rx_origin now 0)=
....<br>
=C2=A0 doing &#39;other&#39; fbm scan, [4..10] gave 9<br>
=C2=A0 Found floating substr &quot;s&quot; at offset 9 (rx_origin now 0)...=
<br>
=C2=A0 (multiline anchor test skipped)<br>
Intuit: Successfully guessed: match at offset 0<br>
=C2=A0 =C2=A00 &lt;&gt; &lt;Perl rules&gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0|=C2=A0 =C2=A00| 1:EXACT &lt;Perl&gt;(3)<br>
=C2=A0 =C2=A04 &lt;Perl&gt; &lt; rules&gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0|=C2=A0 =C2=A00| 3:STAR(5)<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0|=C2=A0 =C2=A00| REG_ANY can match 6 times o=
ut of 2147483647...<br>
=C2=A0 =C2=A09 &lt;Perl rule&gt; &lt;s&gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0|=C2=A0 =C2=A01|=C2=A0 5:EXACT &lt;s&gt;(7)<br>
=C2=A0 10 &lt;Perl rules&gt; &lt;&gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0|=C2=
=A0 =C2=A01|=C2=A0 7:END(0)<br>
Match successful!<br>
Y<br>
Freeing REx: &quot;Perl.*s&quot;<br>
<br>
<br>
If you&#39;re wearing Peril Sensitive Sunglasses, then the code that does t=
his is<br>
the self-recursive Perl_re_intuit_start() in<br>
<a href=3D"https://perl5.git.perl.org/perl.git/blob/HEAD:/regexec.c" rel=3D=
"noreferrer" target=3D"_blank">https://perl5.git.perl.org/perl.git/blob/HEA=
D:/regexec.c</a><br>
<br>
As a guide to reading that, for things like<br>
<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 DEBUG_EXECUTE_r(Perl_re_printf( aTHX_<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 &quot;=C2=A0 String too short...\n&quot;));=
<br>
<br>
<br>
DEBUG_EXECUTE_r() only generates that debugging output if the command line<=
br>
flag -Dr is set, *and* the `use re &quot;debug&quot;;` de facto sets -Dr, e=
ven on a<br>
perl binary compiled without all the other -D flags enabled.<br>
<br>
<br>
I know that PCRE also exposes some information it derives from the pattern<=
br>
which can be used to pre-match -- see PCRE_INFO_FIRSTCHARACTER and<br>
PCRE_INFO_REQUIREDCHAR in<br>
<a href=3D"https://www.pcre.org/original/doc/html/pcre_fullinfo.html" rel=
=3D"noreferrer" target=3D"_blank">https://www.pcre.org/original/doc/html/pc=
re_fullinfo.html</a><br>
<br>
I *thought* that Google&#39;s re2 library (which came out of code search) c=
ould<br>
return the longest fixed string (if any) as part of its API, but I can&#39;=
t find<br>
that now. In <a href=3D"https://swtch.com/~rsc/regexp/regexp4.html" rel=3D"=
noreferrer" target=3D"_blank">https://swtch.com/~rsc/regexp/regexp4.html</a=
> it mentions building<br>
trigrams (directly) from parsing the regex (with ANDs and ORs) which is wha=
t<br>
some $ork code is doing (but then using PCRE, not re2).<br>
<br>
Anyway, I hope that this helps someone, and helps make Rakudo&#39;s parser =
faster<br>
<br>
I assume that the useful strategy is roughly<br>
<br>
0) Do the most simple thing - no alternations, anchored starts. Reject matc=
h<br>
=C2=A0 =C2=A0if not present<br>
1) Then 1+ character fixed substrings - Reject match if not present<br>
2) Then the big cheat of &quot;string is actually fixed&quot; - compile to =
a sub that<br>
=C2=A0 =C2=A0build a match result immediately, instead of the more general =
engine<br>
3) Then the even more general &quot;fixed string as part of larger regex&qu=
ot;, where<br>
=C2=A0 =C2=A0failure is a fast-path, but otherwise the regular regex engine=
 follows<br>
4) Minimum match length<br>
5) The stuff where one starts to tell the main engine how<br>
<br>
<br>
but of course no plan survives contact with the enemy.<br>
<br>
Nicholas Clark<br>
<br>
1: <a href=3D"https://en.wiktionary.org/wiki/crib" rel=3D"noreferrer" targe=
t=3D"_blank">https://en.wiktionary.org/wiki/crib</a> -- definitions 14 and =
23:<br>
=C2=A0 =C2=A0(usually in the plural) A collection of quotes or references f=
or<br>
=C2=A0 =C2=A0use in speaking, for assembling a written document, or as an a=
id to<br>
=C2=A0 =C2=A0a project of some sort; a crib sheet.<br>
=C2=A0 =C2=A0[and]<br>
=C2=A0 =C2=A0A cheat sheet or past test used by students; crib sheet.<br>
</blockquote></div>

--000000000000dc826305900081cf--
0
b2gills
8/13/2019 2:32:05 PM
> =C2=A0 =C2=A0 use v6;
> =C2=A0 =C2=A0 'abcd' ~~ / . <.before( /c/ )> .=C2=A0 / # "bc"
> =C2=A0 =C2=A0 'abcd' ~~ / . <.before=C2=A0 =C2=A0c=C2=A0 =C2=A0> .=C2=A0=
 / # "bc" # (exactly identical)
>
> A person could change the code in the `before` method to have it do
> something different


At least at the moment, that's not 100% accurate (only by virtue of
leaving out a piece):

timo@schmand ~> perl6 -e 'grammar test { regex before($re) { { say "yo"
} }; regex TOP { <.before "hi"> } }; test.parse("hi")'
yo
Too few positionals passed; expected 2 arguments but got 1
=C2=A0 in regex before at -e line 1
=C2=A0 in regex TOP at -e line 1
=C2=A0 in block <unit> at -e line 1

timo@schmand ~ [1]> perl6 -e 'grammar test { regex before($re) { { say
"yo" } }; regex TOP { <before "hi"> } }; test.parse("hi")'
yo
Too few positionals passed; expected 2 arguments but got 1
=C2=A0 in regex before at -e line 1
=C2=A0 in regex TOP at -e line 1
=C2=A0 in block <unit> at -e line 1

timo@schmand ~ [1]> perl6 -e 'grammar test { regex before($re) { { say
"yo" } }; regex TOP { <?before "hi"> } }; test.parse("hi")'

As you can see, using <?before> gives you the "real" lookahead assertion.

I'm not sure if that's part of the language yet, but should probably
made to be.


This will allow more optimization efforts.
0
timo
8/13/2019 3:01:06 PM
FWIW, at one time there was discussion that "<before>" and "<after>"=20
are actually keywords and not typical method calls that can be overridden=
,
precisely so optimizations can be made.  They're that important to
efficient running of the regexes.


I'm not sure that a formal decision was ever made on this, however.
(I'd be okay with declaring them as keywords that cannot be overridden.)

Pm

On Tue, Aug 13, 2019 at 05:01:06PM +0200, Timo Paulssen wrote:
>=20
> > =A0 =A0 use v6;
> > =A0 =A0 'abcd' ~~ / . <.before( /c/ )> .=A0 / # "bc"
> > =A0 =A0 'abcd' ~~ / . <.before=A0 =A0c=A0 =A0> .=A0 / # "bc" # (exact=
ly identical)
> >
> > A person could change the code in the `before` method to have it do
> > something different
>=20
>=20
> At least at the moment, that's not 100% accurate (only by virtue of
> leaving out a piece):
>=20
> timo@schmand ~> perl6 -e 'grammar test { regex before($re) { { say "yo"
> } }; regex TOP { <.before "hi"> } }; test.parse("hi")'
> yo
> Too few positionals passed; expected 2 arguments but got 1
> =A0 in regex before at -e line 1
> =A0 in regex TOP at -e line 1
> =A0 in block <unit> at -e line 1
>=20
> timo@schmand ~ [1]> perl6 -e 'grammar test { regex before($re) { { say
> "yo" } }; regex TOP { <before "hi"> } }; test.parse("hi")'
> yo
> Too few positionals passed; expected 2 arguments but got 1
> =A0 in regex before at -e line 1
> =A0 in regex TOP at -e line 1
> =A0 in block <unit> at -e line 1
>=20
> timo@schmand ~ [1]> perl6 -e 'grammar test { regex before($re) { { say
> "yo" } }; regex TOP { <?before "hi"> } }; test.parse("hi")'
>=20
> As you can see, using <?before> gives you the "real" lookahead assertio=
n.
>=20
> I'm not sure if that's part of the language yet, but should probably
> made to be.
>=20
>=20
> This will allow more optimization efforts.
0
pmichaud
8/13/2019 3:15:10 PM
Here's some stuff that anybody who wants to work on regex optimization
in perl6 will want to know:

You can get a print-out of rakudo's internal AST that is generated for
the regex by passing --target=3Dast or --target=3Doptimize to the perl6
commandline. I recommend grep -C5 Regex to skip some of the boilerplate
that's always there, but not important for regex stuff.

Here's an example of the output for the optimized p6 regex /foo<?before
bar>/:

> - QAST::Regex(:rxtype(concat) :subtype())
> =C2=A0 - QAST::Regex(:rxtype(scan) :subtype())
> =C2=A0=C2=A0=C2=A0 - foo
> =C2=A0 - QAST::Regex(:rxtype(concat) :subtype())=C2=A0 foo<?before bar>
> =C2=A0=C2=A0=C2=A0 - QAST::Regex(:rxtype(literal) :subtype())=C2=A0 f
> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 - foo
> =C2=A0=C2=A0=C2=A0 - QAST::Regex(:rxtype(literal) :subtype(zerowidth))=C2=
=A0 b
> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 - bar
> =C2=A0 - QAST::Regex(:rxtype(pass) :subtype())

And here it is before optimization:

> - QAST::Regex(:rxtype(concat) :subtype())
> =C2=A0 - QAST::Regex(:rxtype(scan) :subtype())
> =C2=A0=C2=A0=C2=A0 - foo
> =C2=A0 - QAST::Regex(:rxtype(concat) :subtype())=C2=A0 foo<?before bar>
> =C2=A0=C2=A0=C2=A0 - QAST::Regex(:rxtype(literal) :subtype())=C2=A0 f
> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 - foo
> =C2=A0=C2=A0=C2=A0 - QAST::Regex(:rxtype(subrule) :subtype(zerowidth) :=
name(before))=C2=A0
> <?before bar>
> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 - QAST::NodeList
> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 - QAST::SVal(before)
> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 - QAST::Block(:cuid(1))=C2=A0=
 :orig_qast<?> :code_object<?>
> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 - QAST::Var(loca=
l self :decl(param))
> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 - QAST::Var(lexi=
cal $=C2=A2 :decl(var))
> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 - QAST::Regex(:r=
xtype(concat) :subtype())
> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 - QA=
ST::Regex(:rxtype(scan) :subtype())
> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0 - bar
> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 - QA=
ST::Regex(:rxtype(concat) :subtype())=C2=A0 bar
> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0 - QAST::Regex(:rxtype(literal) :subtype())=C2=A0 b
> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0 - bar
> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 - QA=
ST::Regex(:rxtype(pass) :subtype())
> =C2=A0 - QAST::Regex(:rxtype(pass) :subtype())

As you can see, it shrunk quite a bit; Before optimization, the "bar"
inside of <?before bar> was put into the regex as a subrule with its own
complete block with lexical variables and everything. On top of that,
there was a scan operation there, which is normal for regexes in
general, but not useful for a lookahead or lookbehind (since they are
anchored implicitly). The optimizer simply turns all of that into a
"literal" "zerowidth" for "bar".

If you want to see how all that is currently implemented, you can look
at nqp's QRegex::Optimizer. The method "simplify_assertion" is probably
what does the optimization i outlined in the previous paragraph:
https://github.com/perl6/nqp/blob/master/src/QRegex/P6Regex/Optimizer.nqp=
#L88

Here's a perl6 one-liner that lets you eval a piece of perl6 code and
get the optimized QAST result as a tree of QAST nodes:

> perl6 -e 'use Perl6::Compiler:from<NQP>; use nqp; say
> nqp::getcomp("perl6").eval(Q[/foo/], :target<optimize>).dump'
Perhaps prototyping optimizations would be easiest in perl6 and later
translated into nqp.

If you want to hack on the "actual" regex optimizer, I recommend writing
your test code as nqp scripts; nqp is quickly recompiled to have the new
optimizer in it, and it applies the same way to perl6 regexes as it does
to nqp regexes. Running a rakudo with the changed optimizer requires a
full rebuild of rakudo, which of course includes the core setting.

The main optimization opportunities I can see for perl6 regexes are
exactly what nick points out in his mail:

If you can find out that a regex can only match if the target string
contains a certain substring (or even a short list of substrings that
can be or'd together would be interesting), you can generate code that
does a quick first check.

If you can prove that a specific part of the regex has a certain minimum
length, any scan through the target string can be aborted further from
the end. If the regex starts with a lookbehind that has a known minimum
length, the regex match attempt doesn't have to start at the very
beginning of the string.

I'm not entirely sure how we should go about skipping the regex engine
when a "simple match" can quickly be created, though.

On 13/08/2019 11:36, Nicholas Clark wrote:
> I'm cheating here - I'm using an e-mail message as a way to publish som=
e notes
> for Jonathan (or *anyone else interested*):
>
> Jonathan did a talk in Riga, Perl 6 performance update,
> https://perlcon.eu/talk/80
> The (re-uploaded) live stream is at
> https://www.youtube.com/watch?v=3DS5iVBlk7pdg#t=3D4h39m
>
> One thing that this talk revealed is that (currently) Perl 5 beats
> Rakudo on various regular expressions. What *I* know is that this is be=
cause
> Perl 5 cheats - it has an "optimiser", which happens to automatically d=
o
> what jnthn then showed manually implemented in some of his benchmarks.
>
> You can see what the Perl 5 regex optimiser is doing by using the re pr=
agma.
> I'm no expert on this thing, but playing with it, I can see that it can=
 do
> various things
>
> 0) Know that it can't do anything useful (it is skipped)
> 1) Know that a fixed string must be present
>    Look for it (which is fast), not find it, and immediately return "no=
 match"
> 2) Know that a fixed string must be present
>    Look for it find it, carry on into the real engine
>    (which arguably is slower than no optimiser)
> 3) Know that a fixed string must be present, *and* is the entire match
>    Look for it (which is fast), find it, and immediately return "match"=
..
>
> Clearly cases (1) and (3) are the big wins here. Case (3) was the one I=
 had
> forgotten about - the cheating is so good that often the engine is neve=
r
> entered.
>   =20
>
> So, if you run this:
>
> $ perl -wle 'use re "debug"; print "Perl rules" =3D~ /^Perl/ ? "Y" : "n=
"'
> Compiling REx "^Perl"
> Final program:
>    1: SBOL /^/ (2)
>    2: EXACT <Perl> (4)
>    4: END (0)
> anchored "Perl" at 0 (checking anchored noscan) anchored(SBOL) minlen 4
> Matching REx "^Perl" against "Perl rules"
> Intuit: trying to determine minimum start position...
>   Looking for check substr at fixed offset 0...
> Intuit: Successfully guessed: match at offset 0
>    0 <> <Perl rules>         |   0| 1:SBOL /^/(2)
>    0 <> <Perl rules>         |   0| 2:EXACT <Perl>(4)
>    4 <Perl> < rules>         |   0| 4:END(0)
> Match successful!
> Y
> Freeing REx: "^Perl"
>
>
> "Final program" describes how the regex ended up being compiled. This i=
sn't
> really of interest here.
>
> What is of interest is the parts between the two lines "Intuit". That's=
 the
> optimiser. For the case I specifically chose here, what we have is that
>
> 1) the optimiser knows that the fixed string "Perl" must be in target s=
tring
> 2) which it searches for and finds
> 3) at which point it calls the main engine (these lines):
>
>    0 <> <Perl rules>         |   0| 1:SBOL /^/(2)
>    0 <> <Perl rules>         |   0| 2:EXACT <Perl>(4)
>    4 <Perl> < rules>         |   0| 4:END(0)
>
> (that part isn't actually of interest here. I'm just noting it as "this=
 is
> the main engine, and comparable to what Rakudo does).
>
>
> The interesting cases are things like:
>
> $ perl -wle 'use re "debug"; print "Perl rules" =3D~ /er./ ? "Y" : "n"'
> Compiling REx "er."
> Final program:
>    1: EXACT <er> (3)
>    3: REG_ANY (4)
>    4: END (0)
> anchored "er" at 0 (checking anchored) minlen 3
> Matching REx "er." against "Perl rules"
> Intuit: trying to determine minimum start position...
>   doing 'check' fbm scan, [0..9] gave 1
>   Found anchored substr "er" at offset 1 (rx_origin now 1)...
>   (multiline anchor test skipped)
>   try at offset...
> Intuit: Successfully guessed: match at offset 1
>    1 <P> <erl rules>         |   0| 1:EXACT <er>(3)
>    3 <Per> <l rules>         |   0| 3:REG_ANY(4)
>    4 <Perl> < rules>         |   0| 4:END(0)
> Match successful!
> Y
> Freeing REx: "er."
>
>
> You can see that
> 1) the compiler records that the longest fixed string is 'er'
> 2) the optimiser looks for this before even hitting the real engine
> 3) the optimiser tells the real engine that it doesn't even need to con=
sider
>    trying to match at string offset 0
>
>
> and this:
>
> $ perl -wle 'use re "debug"; print "Perl rules" =3D~ /er/ ? "Y" : "n"'
> Compiling REx "er"
> Final program:
>    1: EXACT <er> (3)
>    3: END (0)
> anchored "er" at 0 (checking anchored isall) minlen 2
> Matching REx "er" against "Perl rules"
> Intuit: trying to determine minimum start position...
>   doing 'check' fbm scan, [0..10] gave 1
>   Found anchored substr "er" at offset 1 (rx_origin now 1)...
>   (multiline anchor test skipped)
>   try at offset...
> Intuit: Successfully guessed: match at offset 1
> Y
> Freeing REx: "er"
>
>
> The optimiser matches, and knows that if its crib[1] matches, the entir=
e
> regex must match, so it doesn't even call into the engine.
>
>
> and this:
>
> $ perl -wle 'use re "debug"; print "Perl rules" =3D~ /Good *, #MoarVM/ =
? "Y" : "n"'
> Compiling REx "Good *, #MoarVM"
> Final program:
>    1: EXACT <Good> (3)
>    3: STAR (6)
>    4:   EXACT < > (0)
>    6: EXACT <, #MoarVM> (10)
>   10: END (0)
> anchored "Good" at 0 floating ", #MoarVM" at 4..9223372036854775807 (ch=
ecking floating) minlen 13
> n
> Freeing REx: "Good *, #MoarVM"
>
> The optimiser knows that the pattern can't match less than 13 character=
s, but
> the offered string is too short.
>
> The optimiser also knows at what offsets the within the target string t=
he
> fixed string must lie, if there is also some variable length stuff, but=
 this
> is getting beyond what I know the plan for:
>
> $ perl -wle 'use re "debug"; print "Perl rules" =3D~ /Perl.*s/ ? "Y" : =
"n"'
> Compiling REx "Perl.*s"
> Final program:
>    1: EXACT <Perl> (3)
>    3: STAR (5)
>    4:   REG_ANY (0)
>    5: EXACT <s> (7)
>    7: END (0)
> anchored "Perl" at 0 floating "s" at 4..9223372036854775807 (checking a=
nchored) minlen 5
> Matching REx "Perl.*s" against "Perl rules"
> Intuit: trying to determine minimum start position...
>   doing 'check' fbm scan, [0..9] gave 0
>   Found anchored substr "Perl" at offset 0 (rx_origin now 0)...
>   doing 'other' fbm scan, [4..10] gave 9
>   Found floating substr "s" at offset 9 (rx_origin now 0)...
>   (multiline anchor test skipped)
> Intuit: Successfully guessed: match at offset 0
>    0 <> <Perl rules>         |   0| 1:EXACT <Perl>(3)
>    4 <Perl> < rules>         |   0| 3:STAR(5)
>                              |   0| REG_ANY can match 6 times out of 21=
47483647...
>    9 <Perl rule> <s>         |   1|  5:EXACT <s>(7)
>   10 <Perl rules> <>         |   1|  7:END(0)
> Match successful!
> Y
> Freeing REx: "Perl.*s"
>
>
> If you're wearing Peril Sensitive Sunglasses, then the code that does t=
his is
> the self-recursive Perl_re_intuit_start() in
> https://perl5.git.perl.org/perl.git/blob/HEAD:/regexec.c
>
> As a guide to reading that, for things like
>
>         DEBUG_EXECUTE_r(Perl_re_printf( aTHX_
> 			      "  String too short...\n"));
>
>
> DEBUG_EXECUTE_r() only generates that debugging output if the command l=
ine
> flag -Dr is set, *and* the `use re "debug";` de facto sets -Dr, even on=
 a
> perl binary compiled without all the other -D flags enabled.
>
>
> I know that PCRE also exposes some information it derives from the patt=
ern
> which can be used to pre-match -- see PCRE_INFO_FIRSTCHARACTER and
> PCRE_INFO_REQUIREDCHAR in
> https://www.pcre.org/original/doc/html/pcre_fullinfo.html
>
> I *thought* that Google's re2 library (which came out of code search) c=
ould
> return the longest fixed string (if any) as part of its API, but I can'=
t find
> that now. In https://swtch.com/~rsc/regexp/regexp4.html it mentions bui=
lding
> trigrams (directly) from parsing the regex (with ANDs and ORs) which is=
 what
> some $ork code is doing (but then using PCRE, not re2).
>
> Anyway, I hope that this helps someone, and helps make Rakudo's parser =
faster
>
> I assume that the useful strategy is roughly
>
> 0) Do the most simple thing - no alternations, anchored starts. Reject =
match
>    if not present
> 1) Then 1+ character fixed substrings - Reject match if not present
> 2) Then the big cheat of "string is actually fixed" - compile to a sub =
that
>    build a match result immediately, instead of the more general engine
> 3) Then the even more general "fixed string as part of larger regex", w=
here
>    failure is a fast-path, but otherwise the regular regex engine follo=
ws
> 4) Minimum match length
> 5) The stuff where one starts to tell the main engine how
>
>
> but of course no plan survives contact with the enemy.
>
> Nicholas Clark
>
> 1: https://en.wiktionary.org/wiki/crib -- definitions 14 and 23:
>    (usually in the plural) A collection of quotes or references for
>    use in speaking, for assembling a written document, or as an aid to
>    a project of some sort; a crib sheet.
>    [and]
>    A cheat sheet or past test used by students; crib sheet.
0
timo
8/13/2019 3:18:11 PM
--------------4B9573B1184738635415B44C
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable

On 13/08/2019 16:32, Brad Gilbert wrote:

> Perhaps the biggest one may be the one about passing around =E2=80=9Cfa=
tes=E2=80=9D.
> (I barely understand the basics of this.)


The optimization opportunity Brad is refering to here is relevant mostly
to grammars with deeply nested multi-tokens:

Longest-Token-Matching is implemented by creating an NFA
(nondeterministic finite automaton) out of all the longest declarative
prefixes that go into a given multi, be it with explicit | operators or
with "proto token"/"multi token" etc.

The NFA runs and collects a list of potential decision outcomes that are
valid. Think of them simply as "which method to call" and ignore
explicit | operators for the moment.

An important thing to note about the NFA that get created is that in
order to have the actual full "longest declarative prefix", they need to
incorporate every subrule that can be reached.

Take this simplified example:

A <term> can be a <literal>, a <variable>, a <subroutinecall>, etc.
a <literal> can be an <integer>, a <floatingpoint>, a <complex>, etc.
an <integer> can be <simpleInteger> or <basedInteger>
a <simpleinteger> is basically <[-+]><[1..9]><[0..9]>*
a <basedInteger> is basically <[-+]>0<[obdx]><[1..9 a..f]><[0..9 a..f]>*

Now what does the regex engine currently do when it sees "0xaffe"?

 1. It runs the NFA for <term>, which is able to successfully match the
    entirety of "0xaffe" and points towards "literal" as the result.
 2. Then it calls the <literal> subrule.
 3. It runs the NFA for <literal>, which is able to successfully match
    the entirety of "0xaffe" and points towards "integer" as the result.
 4. Then it calls the <integer> subrule.
 5. It runs the NFA for <integer>, which is able to successfully match
    the entirety of "0xaffe" and points towards "basedInteger" as the
    result.
 6. Then it calls the <basedInteger> subrule.
 7. Then it parses the whole thing.

As you can see, it runs three separate NFAs against the same text,
starting before 0 and ending after e.

The idea of passing on fates is that the first NFA already knows that
after <literal> is called, it will directly call into <integer>, and
then directly into <basedInteger>. However, it does not currently have a
way to communicate deeper results to the regex engine.

I hope that explanation makes sense even if you're not deep in the regex
engine with your head already.

Kind Regards
=C2=A0 - Timo


--------------4B9573B1184738635415B44C
Content-Type: text/html; charset=utf-8
Content-Transfer-Encoding: quoted-printable

<html>
  <head>
    <meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3DUTF=
-8">
  </head>
  <body text=3D"#000000" bgcolor=3D"#FFFFFF">
    <p>On 13/08/2019 16:32, Brad Gilbert wrote:<br>
    </p>
    <blockquote type=3D"cite"
cite=3D"mid:CAD2L-T0YqabkpW-4dnAzT07vGnA+-K5E7h7PUcy_6fUpvSJKOw@mail.gmai=
l.com">
      <div>Perhaps the biggest one may be the one about passing around
        =E2=80=9Cfates=E2=80=9D. (I barely understand the basics of this.=
)</div>
    </blockquote>
    <p><br>
    </p>
    <p>The optimization opportunity Brad is refering to here is relevant
      mostly to grammars with deeply nested multi-tokens:</p>
    <p>Longest-Token-Matching is implemented by creating an NFA
      (nondeterministic finite automaton) out of all the longest
      declarative prefixes that go into a given multi, be it with
      explicit | operators or with "proto token"/"multi token" etc.</p>
    <p>The NFA runs and collects a list of potential decision outcomes
      that are valid. Think of them simply as "which method to call" and
      ignore explicit | operators for the moment.</p>
    <p>An important thing to note about the NFA that get created is that
      in order to have the actual full "longest declarative prefix",
      they need to incorporate every subrule that can be reached.</p>
    <p>Take this simplified example:</p>
    <p>A &lt;term&gt; can be a &lt;literal&gt;, a &lt;variable&gt;, a
      &lt;subroutinecall&gt;, etc.<br>
      a &lt;literal&gt; can be an &lt;integer&gt;, a
      &lt;floatingpoint&gt;, a &lt;complex&gt;, etc.<br>
      an &lt;integer&gt; can be &lt;simpleInteger&gt; or
      &lt;basedInteger&gt;<br>
      a &lt;simpleinteger&gt; is basically
      &lt;[-+]&gt;&lt;[1..9]&gt;&lt;[0..9]&gt;*<br>
      a &lt;basedInteger&gt; is basically
      &lt;[-+]&gt;0&lt;[obdx]&gt;&lt;[1..9 a..f]&gt;&lt;[0..9 a..f]&gt;*<=
/p>
    <p>Now what does the regex engine currently do when it sees
      "0xaffe"?</p>
    <ol>
      <li>It runs the NFA for &lt;term&gt;, which is able to
        successfully match the entirety of "0xaffe" and points towards
        "literal" as the result.</li>
      <li>Then it calls the &lt;literal&gt; subrule.</li>
      <li>It runs the NFA for &lt;literal&gt;, which is able to
        successfully match the entirety of "0xaffe" and points towards
        "integer" as the result.</li>
      <li>Then it calls the &lt;integer&gt; subrule.</li>
      <li>It runs the NFA for &lt;integer&gt;, which is able to
        successfully match the entirety of "0xaffe" and points towards
        "basedInteger" as the result.</li>
      <li>Then it calls the &lt;basedInteger&gt; subrule.</li>
      <li>Then it parses the whole thing.</li>
    </ol>
    <p>As you can see, it runs three separate NFAs against the same
      text, starting before 0 and ending after e.</p>
    <p>The idea of passing on fates is that the first NFA already knows
      that after &lt;literal&gt; is called, it will directly call into
      &lt;integer&gt;, and then directly into &lt;basedInteger&gt;.
      However, it does not currently have a way to communicate deeper
      results to the regex engine.</p>
    <p>I hope that explanation makes sense even if you're not deep in
      the regex engine with your head already.</p>
    <p>Kind Regards<br>
      =C2=A0 - Timo<br>
    </p>
  </body>
</html>

--------------4B9573B1184738635415B44C--
0
timo
8/13/2019 3:38:36 PM
Reply: