RFC: Adding \p{foo=/re/}

--------------63208B1960B190194838A258
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit

The Unicode Technical Standard #18 on regular expressions suggests that 
Unicode properties have what I'm calling a subpattern and they call 
wildcard properties

http://www.unicode.org/reports/tr18/#Wildcard_Properties

I am proposing to implement this in 5.30.  I already have a working 
prototype, which you can find in

https://perl5.git.perl.org/perl.git/shortlog/refs/heads/smoke-me/khw-core

and play with.  Attached is a script that exercises it to create a 
pattern that matches IPV4 addresses in any language, and fails illegal 
ones.  Thus the script would work for Bengali or Thai  numbers.  The 
motivation for this came from Abigail.

Certain things aren't clear to me about how it should behave.  Should 
the default be anchored (as currently) so that you have to begin and/or 
end with '.*' to unanchor it?  I think most uses will want it anchored 
as implied by the equals sign, but that's not how other patterns behave, 
and that inconsistency probably would be too confusing.  One thing that 
might emphasize that it isn't anchored is to make them write

\p{foo=~/bar/}

(requiring a tilde)

Comments?

--------------63208B1960B190194838A258
Content-Type: application/x-perl;
 name="abigail.pl"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
 filename="abigail.pl"

no warnings 'experimental::script_run';
no warnings 'experimental::regex_sets';

# Can match a substring, so this intermediate regex needs to have context or
# anchoring in its final use.  Using nt=de yields decimal digits.  When
# specifying a subset of these, we must include nt=de to prevent things like
# U+00B2 SUPERSCRIPT TWO from matching
my $zero_through_255 =
 qr/ (*sr:                                                  # All from same sript
           (?[ \p{nv=0} & \p{nt=de} ])*                     # Optional leading
                                                            #   zeros
       (                                                    # Then one of:
                            \p{nt=de}{1,2}                  #   0 - 99
           | (?[ \p{nv=1} & \p{nt=de} ])  \p{nt=de}{2}      #   100 - 199
           | (?[ \p{nv=2} & \p{nt=de} ]) 
              (  (?[ \p{nv=[0-4]} & \p{nt=de} ]) \p{nt=de}  #   200 - 249
               | (?[ \p{nv=5}     & \p{nt=de} ])  
                 (?[ \p{nv=[0-5]} & \p{nt=de} ])            #   250 - 255
              )
       )
     )
 /x;


my $ipv4 = qr/ \A (*sr:         $zero_through_255
                        (?: [.] $zero_through_255 ) {3}
                  )
               \z
           /x;


#use re qw(Debug ALL);
print "255.255.255.255" =~ /$ipv4/, "\n";
print "\x{662}\x{665}\x{665}.\x{662}\x{665}\x{665}.\x{662}\x{665}\x{665}.\x{662}\x{665}\x{665}" =~ /$ipv4/, "\n";
print "\x{662}\x{665}\x{665}.\x{662}\x{665}\x{665}.\x{662}\x{665}\x{665}.\x{662}\x{665}\x{666}" =~ /$ipv4/, "\n";
print "\x{662}\x{665}\x{665}.\x{662}\x{665}\x{665}.\x{662}\x{665}\x{665}.\x{662}\x{665}5" =~ /$ipv4/, "\n";

--------------63208B1960B190194838A258--
0
public
2/5/2019 10:47:18 PM
perl.perl5.porters 47520 articles. 0 followers. Follow

26 Replies
44 Views

Similar Articles

[PageSpeed] 5

--00000000000030bebe05812ddc6d
Content-Type: text/plain; charset="UTF-8"

On Tue, Feb 5, 2019 at 4:47 PM Karl Williamson <public@khwilliamson.com>
wrote:

> The Unicode Technical Standard #18 on regular expressions
>
> http://www.unicode.org/reports/tr18/#Wildcard_Properties
>
>
hmm. That fascinating document contains this pearl:

the syntax here is similar to that of Perl Regular Expressions [Perl].) In
some cases, this gives multiple syntactic constructs that provide for the
same functionality.

I guess there's more than one way to say there's more than one way to do
it.


-- 
"I don't know about that, as it is outside of my area of expertise." --
competent specialized practitioners, all the time

--00000000000030bebe05812ddc6d
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div dir=3D"ltr"><br></div><br><div class=3D"gmail_quote">=
<div dir=3D"ltr" class=3D"gmail_attr">On Tue, Feb 5, 2019 at 4:47 PM Karl W=
illiamson &lt;<a href=3D"mailto:public@khwilliamson.com">public@khwilliamso=
n.com</a>&gt; wrote:<br></div><blockquote class=3D"gmail_quote" style=3D"ma=
rgin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:=
1ex">The Unicode Technical Standard #18 on regular expressions=C2=A0<br>
<br>
<a href=3D"http://www.unicode.org/reports/tr18/#Wildcard_Properties" rel=3D=
"noreferrer" target=3D"_blank">http://www.unicode.org/reports/tr18/#Wildcar=
d_Properties</a><br>
<br></blockquote><div><br></div><div>hmm. That fascinating document contain=
s this pearl:</div><div><span style=3D"color:rgb(0,0,0);font-family:Arial,&=
quot;Lucida Sans Unicode&quot;,&quot;Arial Unicode MS&quot;,sans-serif;font=
-size:medium"><br></span></div><div><span style=3D"color:rgb(0,0,0);font-fa=
mily:Arial,&quot;Lucida Sans Unicode&quot;,&quot;Arial Unicode MS&quot;,san=
s-serif;font-size:medium">the syntax here is similar to that of=C2=A0</span=
>Perl Regular Expressions<span style=3D"color:rgb(0,0,0);font-family:Arial,=
&quot;Lucida Sans Unicode&quot;,&quot;Arial Unicode MS&quot;,sans-serif;fon=
t-size:medium">=C2=A0[</span>Perl<span style=3D"color:rgb(0,0,0);font-famil=
y:Arial,&quot;Lucida Sans Unicode&quot;,&quot;Arial Unicode MS&quot;,sans-s=
erif;font-size:medium">].) In some cases, this gives multiple syntactic con=
structs that provide for the same functionality.</span>=C2=A0</div><div><br=
></div><div>I guess there&#39;s more than one way to say there&#39;s more t=
han one way to do it.=C2=A0</div><div><br></div><div><br></div></div>-- <br=
><div dir=3D"ltr" class=3D"gmail_signature"><div dir=3D"ltr"><div>&quot;I d=
on&#39;t know about that, as it is outside of my area of expertise.&quot; -=
- competent specialized practitioners, all the time</div></div></div></div>

--00000000000030bebe05812ddc6d--
0
davidnicol
2/5/2019 11:20:15 PM
--00000000000057c19005812e3e33
Content-Type: text/plain; charset="UTF-8"

On Tue, Feb 5, 2019 at 5:47 PM Karl Williamson <public@khwilliamson.com>
wrote:

> The Unicode Technical Standard #18 on regular expressions suggests that
> Unicode properties have what I'm calling a subpattern and they call
> wildcard properties
>
> http://www.unicode.org/reports/tr18/#Wildcard_Properties
>
> I am proposing to implement this in 5.30.  I already have a working
> prototype, which you can find in
>
> https://perl5.git.perl.org/perl.git/shortlog/refs/heads/smoke-me/khw-core
>
> and play with.  Attached is a script that exercises it to create a
> pattern that matches IPV4 addresses in any language, and fails illegal
> ones.  Thus the script would work for Bengali or Thai  numbers.  The
> motivation for this came from Abigail.
>

Implementing this feature is a great idea.


> Certain things aren't clear to me about how it should behave.  Should
> the default be anchored (as currently) so that you have to begin and/or
> end with '.*' to unanchor it?  I think most uses will want it anchored
> as implied by the equals sign, but that's not how other patterns behave,
> and that inconsistency probably would be too confusing.  One thing that
> might emphasize that it isn't anchored is to make them write
>
> \p{foo=~/bar/}
>
> (requiring a tilde)
>
> Comments?
>

I think it would be best to use the exact syntax as shown in that Unicode
Technical Standard (and document the feature using that syntax), to be as
standards-compliant as possible.  That being said, I see nothing wrong with
allowing an _optional_ tilde as in "\p{foo=~/bar/}" for anyone who finds
that syntax more intuitive.

I'm curious why you say that the equals sign implies an anchored match?
I'm not seeing the connection there.  If anything, an equals sign alone
might be thought to signify assignment, but that's obviously inapplicable
here.  In my mind, "\p{foo=/bar/}" doesn't suggest an anchored pattern
because of the equals sign -- if anything, my inclination would be to
assume the pattern is _not_ anchored, because of the slashes around it.
But that's just my personal opinion/intuition about the semantics implied
by that syntax.

At any rate, the Unicode Technical Report appears to have pretty clear
intentions on this question...

Consider the first example in the table.  The expression "\p{toNfd=/b/}" is
described as:

Characters whose NFD form contains a "b" (U+0062) in the value


In my opinion, the word "contains" in the description above implies an
unanchored search.

More significantly, the second example is "\p{name=/^LATIN LETTER.*P$/}",
which is described as:

Characters with names starting with "LATIN LETTER" and ending with "P"


Notice that this example shows explicit "^" and "$" anchors in the regular
expression, and uses "starting with" and "ending with" in the description
instead of "contains".

This seems unambiguous to me -- if the regular expression would be anchored
by default, there would be no purpose in including explicit "^" and "$"
anchors in this example, since they would be redundant.  Also, they would
have needed "\p{toNfd=/.*b.*/}" for the first example to do a "contains"
match, if anchored by default.

I believe that the implied semantics in that Unicode Technical Report are
clear -- the regular expressions should NOT be anchored unless explicit
anchors are used.

Also, as you said, it would be confusing and inconsistent to anchor them
anyhow.

Deven

--00000000000057c19005812e3e33
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div dir=3D"ltr"><div dir=3D"ltr"><div dir=3D"ltr"><div di=
r=3D"ltr"><div dir=3D"ltr"><div dir=3D"ltr">On Tue, Feb 5, 2019 at 5:47 PM =
Karl Williamson &lt;<a href=3D"mailto:public@khwilliamson.com">public@khwil=
liamson.com</a>&gt; wrote:<br></div><div class=3D"gmail_quote"><blockquote =
class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px sol=
id rgb(204,204,204);padding-left:1ex">The Unicode Technical Standard #18 on=
 regular expressions suggests that <br>
Unicode properties have what I&#39;m calling a subpattern and they call <br=
>
wildcard properties<br>
<br>
<a href=3D"http://www.unicode.org/reports/tr18/#Wildcard_Properties" rel=3D=
"noreferrer" target=3D"_blank">http://www.unicode.org/reports/tr18/#Wildcar=
d_Properties</a><br>
<br>
I am proposing to implement this in 5.30.=C2=A0 I already have a working <b=
r>
prototype, which you can find in<br>
<br>
<a href=3D"https://perl5.git.perl.org/perl.git/shortlog/refs/heads/smoke-me=
/khw-core" rel=3D"noreferrer" target=3D"_blank">https://perl5.git.perl.org/=
perl.git/shortlog/refs/heads/smoke-me/khw-core</a><br>
<br>
and play with.=C2=A0 Attached is a script that exercises it to create a <br=
>
pattern that matches IPV4 addresses in any language, and fails illegal <br>
ones.=C2=A0 Thus the script would work for Bengali or Thai=C2=A0 numbers.=
=C2=A0 The <br>
motivation for this came from Abigail.<br></blockquote><div><br></div><div>=
Implementing this feature is a great idea.</div><div>=C2=A0</div><blockquot=
e class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px s=
olid rgb(204,204,204);padding-left:1ex">
Certain things aren&#39;t clear to me about how it should behave.=C2=A0 Sho=
uld <br>
the default be anchored (as currently) so that you have to begin and/or <br=
>
end with &#39;.*&#39; to unanchor it?=C2=A0 I think most uses will want it =
anchored <br>
as implied by the equals sign, but that&#39;s not how other patterns behave=
, <br>
and that inconsistency probably would be too confusing.=C2=A0 One thing tha=
t <br>
might emphasize that it isn&#39;t anchored is to make them write<br>
<br>
\p{foo=3D~/bar/}<br>
<br>
(requiring a tilde)<br>
<br>
Comments?<br></blockquote><div><br></div><div>I think it would be best to u=
se the exact syntax as shown in that Unicode Technical Standard (and docume=
nt the feature using that syntax), to be as standards-compliant as possible=
..=C2=A0 That being said, I see nothing wrong with allowing an _optional_ ti=
lde as in &quot;\p{foo=3D~/bar/}&quot; for anyone who finds that syntax mor=
e intuitive.</div><div><br></div><div>I&#39;m curious why you say that the =
equals sign implies an anchored match?=C2=A0 I&#39;m not seeing the connect=
ion there.=C2=A0 If anything, an equals sign alone might be thought to sign=
ify assignment, but that&#39;s obviously inapplicable here.=C2=A0 In my min=
d, &quot;\p{foo=3D/bar/}&quot; doesn&#39;t suggest an anchored pattern beca=
use of the equals sign -- if anything, my inclination would be to assume th=
e pattern is _not_ anchored, because of the slashes around it.=C2=A0 But th=
at&#39;s just my personal opinion/intuition about the semantics implied by =
that syntax.</div><div><br></div><div>At any rate, the Unicode Technical Re=
port appears to have pretty clear intentions on this question...</div><div>=
<br></div><div>Consider the first example in the table.=C2=A0 The expressio=
n &quot;\p{toNfd=3D/b/}&quot; is described as:</div><div><br></div><blockqu=
ote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px=
 solid rgb(204,204,204);padding-left:1ex">Characters whose NFD form contain=
s a &quot;b&quot; (U+0062) in the value</blockquote><div><br></div><div>In =
my opinion, the word &quot;contains&quot; in the description above implies =
an unanchored search.</div><div><br></div><div>More significantly, the seco=
nd example is &quot;\p{name=3D/^LATIN LETTER.*P$/}&quot;, which is describe=
d as:</div><div><br></div><blockquote class=3D"gmail_quote" style=3D"margin=
:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"=
>Characters with names starting with &quot;LATIN LETTER&quot; and ending wi=
th &quot;P&quot;</blockquote><div><br></div><div>Notice that this example s=
hows explicit &quot;^&quot; and &quot;$&quot; anchors in the regular expres=
sion, and uses &quot;starting with&quot; and &quot;ending with&quot; in the=
 description instead of &quot;contains&quot;.</div><div><br></div><div>This=
 seems unambiguous to me -- if the regular expression would be anchored by =
default, there would be no purpose in including explicit &quot;^&quot; and =
&quot;$&quot; anchors in this example, since they would be redundant.=C2=A0=
 Also, they would have needed &quot;\p{toNfd=3D/.*b.*/}&quot; for the first=
 example to do a &quot;contains&quot; match, if anchored by default.</div><=
div><br></div><div>I believe that the implied semantics in that Unicode Tec=
hnical Report are clear -- the regular expressions should NOT be anchored u=
nless explicit anchors are used.</div><div><br></div><div>Also, as you said=
, it would be confusing and inconsistent to anchor them anyhow.</div><div><=
br></div><div>Deven</div><div><br></div></div></div></div></div></div></div=
></div>

--00000000000057c19005812e3e33--
0
deven
2/5/2019 11:48:06 PM
--000000000000329df205812e5c38
Content-Type: text/plain; charset="UTF-8"

On Tue, Feb 5, 2019 at 6:48 PM Deven T. Corzine <deven@ties.org> wrote:

> I believe that the implied semantics in that Unicode Technical Report are
> clear -- the regular expressions should NOT be anchored unless explicit
> anchors are used.
>

Actually, I hadn't looked closely at the third example in the table, but
it's completely unambiguous.  The expression "\p{name=/VARIA(TION|NT)/}" is
described as:

Characters with names containing "VARIATION" or "VARIANT"


This is obviously showing an unanchored search, given that the matched set
includes matches such as these:

MONGOLIAN FREE VARIATION SELECTOR THREE
> RIGHT ANGLE VARIANT WITH SQUARE
> CUNEIFORM SIGN KU4 VARIANT FORM


I rest my case. :-)

Deven

--000000000000329df205812e5c38
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div dir=3D"ltr"><div dir=3D"ltr"><div dir=3D"ltr"><div di=
r=3D"ltr"><div dir=3D"ltr"><div dir=3D"ltr">On Tue, Feb 5, 2019 at 6:48 PM =
Deven T. Corzine &lt;<a href=3D"mailto:deven@ties.org">deven@ties.org</a>&g=
t; wrote:<br></div><div class=3D"gmail_quote"><blockquote class=3D"gmail_qu=
ote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,20=
4);padding-left:1ex"><div dir=3D"ltr"><div dir=3D"ltr"><div dir=3D"ltr"><di=
v dir=3D"ltr"><div dir=3D"ltr"><div dir=3D"ltr"><div dir=3D"ltr">I believe =
that the implied semantics in that Unicode Technical Report are clear -- th=
e regular expressions should NOT be anchored unless explicit anchors are us=
ed.</div></div></div></div></div></div></div></blockquote><div><br></div><d=
iv>Actually, I hadn&#39;t looked closely at the third example in the table,=
 but it&#39;s completely unambiguous.=C2=A0 The expression &quot;\p{name=3D=
/VARIA(TION|NT)/}&quot; is described as:</div><div><br></div><blockquote cl=
ass=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid=
 rgb(204,204,204);padding-left:1ex">Characters with names containing &quot;=
VARIATION&quot; or &quot;VARIANT&quot;</blockquote><div><br></div><div>This=
 is obviously showing an unanchored search, given that the matched set incl=
udes matches such as these:</div><div><br></div><blockquote class=3D"gmail_=
quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,=
204);padding-left:1ex">MONGOLIAN FREE VARIATION SELECTOR THREE<br>RIGHT ANG=
LE VARIANT WITH SQUARE<br>CUNEIFORM SIGN KU4 VARIANT FORM</blockquote><div>=
<br></div><div>I rest my case. :-)</div><div><br></div><div>Deven</div><div=
><br></div></div></div></div></div></div></div></div>

--000000000000329df205812e5c38--
0
deven
2/5/2019 11:56:27 PM
On Tue, Feb 05, 2019 at 03:47:18PM -0700, Karl Williamson wrote:
> The Unicode Technical Standard #18 on regular expressions suggests that
> Unicode properties have what I'm calling a subpattern and they call wildcard
> properties
> 
> http://www.unicode.org/reports/tr18/#Wildcard_Properties
> 
> I am proposing to implement this in 5.30.  I already have a working
> prototype, which you can find in
> 
> https://perl5.git.perl.org/perl.git/shortlog/refs/heads/smoke-me/khw-core
> 
> and play with.  Attached is a script that exercises it to create a pattern
> that matches IPV4 addresses in any language, and fails illegal ones.  Thus
> the script would work for Bengali or Thai  numbers.  The motivation for this
> came from Abigail.
> 
> Certain things aren't clear to me about how it should behave.  Should the
> default be anchored (as currently) so that you have to begin and/or end with
> '.*' to unanchor it?  I think most uses will want it anchored as implied by
> the equals sign, but that's not how other patterns behave, and that
> inconsistency probably would be too confusing.  One thing that might
> emphasize that it isn't anchored is to make them write
> 
> \p{foo=~/bar/}
> 
> (requiring a tilde)
> 
> Comments?

Some of the examples in TR18 would fail if the regexp was anchored by
default.

The cases that do need anchoring in the examples use anchoring syntax:

http://unicode.org/cldr/utility/list-unicodeset.jsp?a=\p{name=/^LATIN%20LETTER.*P$/}

Tony
0
tony
2/5/2019 11:59:23 PM
On 2/5/19 4:59 PM, Tony Cook wrote:
> On Tue, Feb 05, 2019 at 03:47:18PM -0700, Karl Williamson wrote:
>> The Unicode Technical Standard #18 on regular expressions suggests that
>> Unicode properties have what I'm calling a subpattern and they call wildcard
>> properties
>>
>> http://www.unicode.org/reports/tr18/#Wildcard_Properties
>>
>> I am proposing to implement this in 5.30.  I already have a working
>> prototype, which you can find in
>>
>> https://perl5.git.perl.org/perl.git/shortlog/refs/heads/smoke-me/khw-core
>>
>> and play with.  Attached is a script that exercises it to create a pattern
>> that matches IPV4 addresses in any language, and fails illegal ones.  Thus
>> the script would work for Bengali or Thai  numbers.  The motivation for this
>> came from Abigail.
>>
>> Certain things aren't clear to me about how it should behave.  Should the
>> default be anchored (as currently) so that you have to begin and/or end with
>> '.*' to unanchor it?  I think most uses will want it anchored as implied by
>> the equals sign, but that's not how other patterns behave, and that
>> inconsistency probably would be too confusing.  One thing that might
>> emphasize that it isn't anchored is to make them write
>>
>> \p{foo=~/bar/}
>>
>> (requiring a tilde)
>>
>> Comments?
> 
> Some of the examples in TR18 would fail if the regexp was anchored by
> default.
> 
> The cases that do need anchoring in the examples use anchoring syntax:
> 
> http://unicode.org/cldr/utility/list-unicodeset.jsp?a=\p{name=/^LATIN%20LETTER.*P$/}
> 
> Tony
> 

Although it's called a technical standard, it's not actually a part of 
the Unicode Standard, and even though those clauses are written as if 
they are requirements, they're not.

This was made clear to me when we followed this document closely, and 
then Unicode made a contradictory rule in the actual Standard.  When I 
pointed this out, they (did seem to be embarrassed, and) said UTS 18 
isn't a standard, and they removed the language from it, leaving us in 
the lurch.  There was a deprecation period for people who were using 
what we had furnished, before we fully supported the Standard again.

The lesson here is that Unicode doesn't always know best, and we need to 
exercise judgment in following them.  Various things from this document 
have been withdrawn as a result of my and others questioning them.  One 
I noticed again today is 2.1, where there there used to be an RL2.1 
apparent requirement.  This document appears to have been written by a 
bunch of people sitting around and brainstorming what would be nice, but 
without an implementation to test things out on.

We already differ significantly from their syntaxes.  Our set notation 
is different; we don't have a \p{name=...} syntax, etc.

I knew that they thought the patterns weren't anchored, but my 
experience indicates we should do what we think is best in this regards, 
which may be the unanchored approach.  But I want to hear what people 
think from a perl-based view.
0
public
2/6/2019 12:33:23 AM
On Tue, Feb 05, 2019 at 05:33:23PM -0700, Karl Williamson wrote:
> Although it's called a technical standard, it's not actually a part of the
> Unicode Standard, and even though those clauses are written as if they are
> requirements, they're not.
> 
> This was made clear to me when we followed this document closely, and then
> Unicode made a contradictory rule in the actual Standard.  When I pointed
> this out, they (did seem to be embarrassed, and) said UTS 18 isn't a
> standard, and they removed the language from it, leaving us in the lurch.
> There was a deprecation period for people who were using what we had
> furnished, before we fully supported the Standard again.
> 
> The lesson here is that Unicode doesn't always know best, and we need to
> exercise judgment in following them.  Various things from this document have
> been withdrawn as a result of my and others questioning them.  One I noticed
> again today is 2.1, where there there used to be an RL2.1 apparent
> requirement.  This document appears to have been written by a bunch of
> people sitting around and brainstorming what would be nice, but without an
> implementation to test things out on.
> 
> We already differ significantly from their syntaxes.  Our set notation is
> different; we don't have a \p{name=...} syntax, etc.
> 
> I knew that they thought the patterns weren't anchored, but my experience
> indicates we should do what we think is best in this regards, which may be
> the unanchored approach.  But I want to hear what people think from a
> perl-based view.

From a perl POV, I think it should still be unanchored, since the
syntax is that of a regexp.

I expect both anchored and unanchored to be about as useful as the
other, but I suspect you have a better understanding of the possible
range of value for Unicode properties.

If we're not that worried about following TR18 we could change the
syntax some more to \p{foo=~/.../}

Tony
0
tony
2/6/2019 12:50:16 AM
On Tue, Feb 05, 2019 at 03:47:18PM -0700, Karl Williamson wrote:
>
> Certain things aren't clear to me about how it should behave.  Should  
> the default be anchored (as currently) so that you have to begin and/or  
> end with '.*' to unanchor it?  I think most uses will want it anchored  
> as implied by the equals sign, but that's not how other patterns behave,  
> and that inconsistency probably would be too confusing.  One thing that  
> might emphasize that it isn't anchored is to make them write
>


My vote goes to it being unanchored. Although I'd expect most cases
will prefer an anchored match, it's only one character to make it
anchored. So, defaulting to anchoring is a small gain, and, IMO,
not worth the inconsistency compared to regular regular expressions.


Abigail
0
booking
2/6/2019 1:01:01 AM
--000000000000eb8929058130327c
Content-Type: text/plain; charset="UTF-8"

On Tue, Feb 5, 2019 at 7:33 PM Karl Williamson <public@khwilliamson.com>
wrote:

> Although it's called a technical standard, it's not actually a part of
> the Unicode Standard, and even though those clauses are written as if
> they are requirements, they're not.
>

That's a good point to keep in mind.


> This was made clear to me when we followed this document closely, and
> then Unicode made a contradictory rule in the actual Standard.  When I
> pointed this out, they (did seem to be embarrassed, and) said UTS 18
> isn't a standard, and they removed the language from it, leaving us in
> the lurch.  There was a deprecation period for people who were using
> what we had furnished, before we fully supported the Standard again.
>

Sounds like they pulled a technicality to save face...

The lesson here is that Unicode doesn't always know best, and we need to
> exercise judgment in following them.  Various things from this document
> have been withdrawn as a result of my and others questioning them.  One
> I noticed again today is 2.1, where there there used to be an RL2.1
> apparent requirement.  This document appears to have been written by a
> bunch of people sitting around and brainstorming what would be nice, but
> without an implementation to test things out on.
>

That's probably exactly what happened.

It sucks that they've jerked you around before, but on the bright side,
your questions probably helped make the final standard better, right?

We already differ significantly from their syntaxes.  Our set notation
> is different; we don't have a \p{name=...} syntax, etc.
>

If there's already syntax differences, I guess that's a decent argument for
requiring "=~/.../" then.


> I knew that they thought the patterns weren't anchored, but my
> experience indicates we should do what we think is best in this regards,
> which may be the unanchored approach.  But I want to hear what people
> think from a perl-based view.
>

Personally, unanchored would be my vote.  It makes much more sense to me
than anchored, which feels awkward and inconsistent.

Deven

--000000000000eb8929058130327c
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div dir=3D"ltr">On Tue, Feb 5, 2019 at 7:33 PM Karl Willi=
amson &lt;<a href=3D"mailto:public@khwilliamson.com">public@khwilliamson.co=
m</a>&gt; wrote:<br></div><div class=3D"gmail_quote"><blockquote class=3D"g=
mail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204=
,204,204);padding-left:1ex">Although it&#39;s called a technical standard, =
it&#39;s not actually a part of <br>
the Unicode Standard, and even though those clauses are written as if <br>
they are requirements, they&#39;re not.<br></blockquote><div><br></div><div=
>That&#39;s a good point to keep in mind.=C2=A0</div><div>=C2=A0</div><bloc=
kquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:=
1px solid rgb(204,204,204);padding-left:1ex">
This was made clear to me when we followed this document closely, and <br>
then Unicode made a contradictory rule in the actual Standard.=C2=A0 When I=
 <br>
pointed this out, they (did seem to be embarrassed, and) said UTS 18 <br>
isn&#39;t a standard, and they removed the language from it, leaving us in =
<br>
the lurch.=C2=A0 There was a deprecation period for people who were using <=
br>
what we had furnished, before we fully supported the Standard again.<br></b=
lockquote><div><br></div><div>Sounds like they pulled a technicality to sav=
e face...</div><div><br></div><blockquote class=3D"gmail_quote" style=3D"ma=
rgin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:=
1ex">The lesson here is that Unicode doesn&#39;t always know best, and we n=
eed to <br>
exercise judgment in following them.=C2=A0 Various things from this documen=
t <br>
have been withdrawn as a result of my and others questioning them.=C2=A0 On=
e <br>
I noticed again today is 2.1, where there there used to be an RL2.1 <br>
apparent requirement.=C2=A0 This document appears to have been written by a=
 <br>
bunch of people sitting around and brainstorming what would be nice, but <b=
r>
without an implementation to test things out on.<br></blockquote><div><br><=
/div><div>That&#39;s probably exactly what happened.</div><div><br></div><d=
iv>It sucks that they&#39;ve jerked you around before, but on the bright si=
de, your questions probably helped make the final standard better, right?</=
div><div><br></div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0p=
x 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">We alr=
eady differ significantly from their syntaxes.=C2=A0 Our set notation <br>
is different; we don&#39;t have a \p{name=3D...} syntax, etc.<br></blockquo=
te><div><br></div><div>If there&#39;s already syntax differences, I guess t=
hat&#39;s a decent argument for requiring &quot;=3D~/.../&quot; then.=C2=A0=
</div><div>=C2=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0p=
x 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">I =
knew that they thought the patterns weren&#39;t anchored, but my <br>
experience indicates we should do what we think is best in this regards, <b=
r>
which may be the unanchored approach.=C2=A0 But I want to hear what people =
<br>
think from a perl-based view.<br></blockquote><div><br></div><div>Personall=
y, unanchored would be my vote.=C2=A0 It makes much more sense to me than a=
nchored, which feels awkward and inconsistent.</div><div><br></div><div>Dev=
en</div><div><br></div></div></div>

--000000000000eb8929058130327c--
0
deven
2/6/2019 2:08:04 AM
--00000000000026f26a058133d4bb
Content-Type: text/plain; charset="UTF-8"

Fwiw, I don't like it. What happens if the pattern includes capture
brackets, named recursion or eval ? This seems like a way to squeeze named
recursion concepts into the named property functionality without thinking
through the ramifications.

Yves

On Wed, 6 Feb 2019, 06:47 Karl Williamson <public@khwilliamson.com wrote:

> The Unicode Technical Standard #18 on regular expressions suggests that
> Unicode properties have what I'm calling a subpattern and they call
> wildcard properties
>
> http://www.unicode.org/reports/tr18/#Wildcard_Properties
>
> I am proposing to implement this in 5.30.  I already have a working
> prototype, which you can find in
>
> https://perl5.git.perl.org/perl.git/shortlog/refs/heads/smoke-me/khw-core
>
> and play with.  Attached is a script that exercises it to create a
> pattern that matches IPV4 addresses in any language, and fails illegal
> ones.  Thus the script would work for Bengali or Thai  numbers.  The
> motivation for this came from Abigail.
>
> Certain things aren't clear to me about how it should behave.  Should
> the default be anchored (as currently) so that you have to begin and/or
> end with '.*' to unanchor it?  I think most uses will want it anchored
> as implied by the equals sign, but that's not how other patterns behave,
> and that inconsistency probably would be too confusing.  One thing that
> might emphasize that it isn't anchored is to make them write
>
> \p{foo=~/bar/}
>
> (requiring a tilde)
>
> Comments?
>

--00000000000026f26a058133d4bb
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"auto">Fwiw, I don&#39;t like it. What happens if the pattern in=
cludes capture brackets, named recursion or eval ? This seems like a way to=
 squeeze named recursion concepts into the named property functionality wit=
hout thinking through the ramifications.<div dir=3D"auto"><br></div><div di=
r=3D"auto">Yves</div></div><br><div class=3D"gmail_quote"><div dir=3D"ltr">=
On Wed, 6 Feb 2019, 06:47 Karl Williamson &lt;<a href=3D"mailto:public@khwi=
lliamson.com">public@khwilliamson.com</a> wrote:<br></div><blockquote class=
=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padd=
ing-left:1ex">The Unicode Technical Standard #18 on regular expressions sug=
gests that <br>
Unicode properties have what I&#39;m calling a subpattern and they call <br=
>
wildcard properties<br>
<br>
<a href=3D"http://www.unicode.org/reports/tr18/#Wildcard_Properties" rel=3D=
"noreferrer noreferrer" target=3D"_blank">http://www.unicode.org/reports/tr=
18/#Wildcard_Properties</a><br>
<br>
I am proposing to implement this in 5.30.=C2=A0 I already have a working <b=
r>
prototype, which you can find in<br>
<br>
<a href=3D"https://perl5.git.perl.org/perl.git/shortlog/refs/heads/smoke-me=
/khw-core" rel=3D"noreferrer noreferrer" target=3D"_blank">https://perl5.gi=
t.perl.org/perl.git/shortlog/refs/heads/smoke-me/khw-core</a><br>
<br>
and play with.=C2=A0 Attached is a script that exercises it to create a <br=
>
pattern that matches IPV4 addresses in any language, and fails illegal <br>
ones.=C2=A0 Thus the script would work for Bengali or Thai=C2=A0 numbers.=
=C2=A0 The <br>
motivation for this came from Abigail.<br>
<br>
Certain things aren&#39;t clear to me about how it should behave.=C2=A0 Sho=
uld <br>
the default be anchored (as currently) so that you have to begin and/or <br=
>
end with &#39;.*&#39; to unanchor it?=C2=A0 I think most uses will want it =
anchored <br>
as implied by the equals sign, but that&#39;s not how other patterns behave=
, <br>
and that inconsistency probably would be too confusing.=C2=A0 One thing tha=
t <br>
might emphasize that it isn&#39;t anchored is to make them write<br>
<br>
\p{foo=3D~/bar/}<br>
<br>
(requiring a tilde)<br>
<br>
Comments?<br>
</blockquote></div>

--00000000000026f26a058133d4bb--
0
demerphq
2/6/2019 6:27:55 AM
On 2/5/19 11:27 PM, demerphq wrote:
> Fwiw, I don't like it. What happens if the pattern includes capture=20
> brackets, named recursion or eval ? This seems like a way to squeeze=20
> named recursion concepts into the named property functionality without=20
> thinking through the ramifications.
>=20
> Yves

The way it's implemented is a separate regex is compiled and executed=20
during the compilation of the outer one.  Maybe you know something about=20
how that could fail, but it works in my limited testing, so I'm not sure=20
you're stated concerns are valid.

It calls subpattern_re =3D re_compile(pattern, 0);
and then pregexec(subpattern_re, ...)

>=20
> On Wed, 6 Feb 2019, 06:47 Karl Williamson <public@khwilliamson.com=20
> <mailto:public@khwilliamson.com> wrote:
>=20
>     The Unicode Technical Standard #18 on regular expressions suggests =
that
>     Unicode properties have what I'm calling a subpattern and they call
>     wildcard properties
>=20
>     http://www.unicode.org/reports/tr18/#Wildcard_Properties
>=20
>     I am proposing to implement this in 5.30.=C2=A0 I already have a wo=
rking
>     prototype, which you can find in
>=20
>     https://perl5.git.perl.org/perl.git/shortlog/refs/heads/smoke-me/kh=
w-core
>=20
>     and play with.=C2=A0 Attached is a script that exercises it to crea=
te a
>     pattern that matches IPV4 addresses in any language, and fails ille=
gal
>     ones.=C2=A0 Thus the script would work for Bengali or Thai=C2=A0 nu=
mbers.=C2=A0 The
>     motivation for this came from Abigail.
>=20
>     Certain things aren't clear to me about how it should behave.=C2=A0=
 Should
>     the default be anchored (as currently) so that you have to begin an=
d/or
>     end with '.*' to unanchor it?=C2=A0 I think most uses will want it =
anchored
>     as implied by the equals sign, but that's not how other patterns
>     behave,
>     and that inconsistency probably would be too confusing.=C2=A0 One t=
hing that
>     might emphasize that it isn't anchored is to make them write
>=20
>     \p{foo=3D~/bar/}
>=20
>     (requiring a tilde)
>=20
>     Comments?
>=20
0
public
2/6/2019 7:46:45 PM
--0000000000008d3c6b05813f403e
Content-Type: text/plain; charset="UTF-8"

 Which context is the "should it be anchored or not" question concerning?

The examples in the "The following table shows examples of the use of
wildcards" table clearly shows the regexes against the character code names
are not anchored: there are two "contains" examples without anchors, and
one "starting with" example with an explicit hat anchor.

On Tue, Feb 5, 2019 at 6:33 PM Karl Williamson <public@khwilliamson.com>
wrote:

>
> I knew that they thought the patterns weren't anchored, but my
> experience indicates we should do what we think is best in this regards,
> which may be the unanchored approach.  But I want to hear what people
> think from a perl-based view.
>

--0000000000008d3c6b05813f403e
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div><br></div><div>=C2=A0Which context is the &quot;shoul=
d it be anchored or not&quot; question concerning?=C2=A0=C2=A0<br></div><di=
v><br></div>The examples in the &quot;<span style=3D"color:rgb(0,0,0);font-=
family:Arial,&quot;Lucida Sans Unicode&quot;,&quot;Arial Unicode MS&quot;,s=
ans-serif;font-size:medium">The following table shows examples of the use o=
f wildcards</span>&quot; table clearly shows the regexes against the charac=
ter code names are not anchored: there are two &quot;contains&quot; example=
s without anchors, and one &quot;starting with&quot; example with an explic=
it hat anchor.<br><br><div class=3D"gmail_quote"><div dir=3D"ltr" class=3D"=
gmail_attr">On Tue, Feb 5, 2019 at 6:33 PM Karl Williamson &lt;<a href=3D"m=
ailto:public@khwilliamson.com">public@khwilliamson.com</a>&gt; wrote:</div>=
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">
<br>
I knew that they thought the patterns weren&#39;t anchored, but my <br>
experience indicates we should do what we think is best in this regards, <b=
r>
which may be the unanchored approach.=C2=A0 But I want to hear what people =
<br>
think from a perl-based view.<br>
</blockquote></div><br clear=3D"all"><div><br></div></div>

--0000000000008d3c6b05813f403e--
0
davidnicol
2/6/2019 8:05:13 PM
--000000000000049ee0058146c15a
Content-Type: text/plain; charset="UTF-8"

On Wed, Feb 6, 2019 at 2:47 PM Karl Williamson <public@khwilliamson.com>
wrote:

> On 2/5/19 11:27 PM, demerphq wrote:
> > Fwiw, I don't like it. What happens if the pattern includes capture
> > brackets, named recursion or eval ? This seems like a way to squeeze
> > named recursion concepts into the named property functionality without
> > thinking through the ramifications.
> >
> > Yves
>

Yves, do you still have concerns if the property regular expression is
evaluated independently?


> The way it's implemented is a separate regex is compiled and executed
> during the compilation of the outer one.  Maybe you know something about
> how that could fail, but it works in my limited testing, so I'm not sure
> you're stated concerns are valid.
>
> It calls subpattern_re = re_compile(pattern, 0);
> and then pregexec(subpattern_re, ...)
>

Does this work inside a character class?

Deven

--000000000000049ee0058146c15a
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div dir=3D"ltr"><br></div><br><div class=3D"gmail_quote">=
<div dir=3D"ltr" class=3D"gmail_attr">On Wed, Feb 6, 2019 at 2:47 PM Karl W=
illiamson &lt;<a href=3D"mailto:public@khwilliamson.com">public@khwilliamso=
n.com</a>&gt; wrote:<br></div><blockquote class=3D"gmail_quote" style=3D"ma=
rgin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:=
1ex">On 2/5/19 11:27 PM, demerphq wrote:<br>
&gt; Fwiw, I don&#39;t like it. What happens if the pattern includes captur=
e <br>
&gt; brackets, named recursion or eval ? This seems like a way to squeeze <=
br>
&gt; named recursion concepts into the named property functionality without=
 <br>
&gt; thinking through the ramifications.<br>
&gt; <br>
&gt; Yves<br></blockquote><div><br></div><div>Yves, do you still have conce=
rns if the property regular expression is evaluated independently?</div><di=
v>=C2=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px=
 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
The way it&#39;s implemented is a separate regex is compiled and executed <=
br>
during the compilation of the outer one.=C2=A0 Maybe you know something abo=
ut <br>
how that could fail, but it works in my limited testing, so I&#39;m not sur=
e <br>
you&#39;re stated concerns are valid.<br>
<br>
It calls subpattern_re =3D re_compile(pattern, 0);<br>
and then pregexec(subpattern_re, ...)<br></blockquote><div><br></div><div>D=
oes this work inside a character class?</div><div><br></div><div>Deven</div=
><div>=C2=A0</div></div></div>

--000000000000049ee0058146c15a--
0
deven
2/7/2019 5:02:38 AM
--000000000000f9e7d4058163ceb7
Content-Type: text/plain; charset="UTF-8"

On Tue, Feb 5, 2019 at 5:47 PM Karl Williamson <public@khwilliamson.com>
wrote:

> I am proposing to implement this in 5.30.  I already have a working
> prototype, which you can find in
>
> https://perl5.git.perl.org/perl.git/shortlog/refs/heads/smoke-me/khw-core
>

It could use a couple casts to silence compiler warnings...

Deven

diff --git a/regcomp.c b/regcomp.c
index 27efe853c8..b19fab4768 100644
--- a/regcomp.c
+++ b/regcomp.c
@@ -21898,7 +21898,7 @@ Perl_handle_user_defined_property(pTHX_
         }

         do {
-            if (min > (MAX_LEGAL_CP >> 4)) {
+            if (min > (IV)(MAX_LEGAL_CP >> 4)) {
                 s = strchr(s, '\n');
                 if (s == NULL) {
                     s = e;
@@ -21933,7 +21933,7 @@ Perl_handle_user_defined_property(pTHX_
             /* Look for the high point of the range */
             max = 0;
             do {
-                if (max > (MAX_LEGAL_CP >> 4)) {
+                if (max > (IV)(MAX_LEGAL_CP >> 4)) {
                     s = strchr(s, '\n');
                     if (s == NULL) {
                         s = e;

--000000000000f9e7d4058163ceb7
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div dir=3D"ltr"><div dir=3D"ltr">On Tue, Feb 5, 2019 at 5=
:47 PM Karl Williamson &lt;<a href=3D"mailto:public@khwilliamson.com">publi=
c@khwilliamson.com</a>&gt; wrote:<br></div><div class=3D"gmail_quote"><bloc=
kquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:=
1px solid rgb(204,204,204);padding-left:1ex">I am proposing to implement th=
is in 5.30.=C2=A0 I already have a working <br>
prototype, which you can find in<br>
<br>
<a href=3D"https://perl5.git.perl.org/perl.git/shortlog/refs/heads/smoke-me=
/khw-core" rel=3D"noreferrer" target=3D"_blank">https://perl5.git.perl.org/=
perl.git/shortlog/refs/heads/smoke-me/khw-core</a><br></blockquote><div><br=
></div><div>It could use a couple casts to silence compiler warnings...</di=
v><div><br></div><div>Deven</div><div><br></div><div>diff --git a/regcomp.c=
 b/regcomp.c</div><div>index 27efe853c8..b19fab4768 100644</div><div>--- a/=
regcomp.c</div><div>+++ b/regcomp.c</div><div>@@ -21898,7 +21898,7 @@ Perl_=
handle_user_defined_property(pTHX_</div><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0}</div><div><br></div><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0do {</di=
v><div>-=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 if (min &gt; (MAX_LEGAL_C=
P &gt;&gt; 4)) {</div><div>+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 if (m=
in &gt; (IV)(MAX_LEGAL_CP &gt;&gt; 4)) {</div><div>=C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0s =3D strchr(s, &#39;\n&#39;);</div><=
div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if (s =3D=
=3D NULL) {</div><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0s =3D e;</div><div>@@ -21933,7 +21933,7 @@ Perl_han=
dle_user_defined_property(pTHX_</div><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0/* Look for the high point of the range */</div><div>=C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0max =3D 0;</div><div>=C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0do {</div><div>-=C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 if (max &gt; (MAX_LEGAL_CP &gt;&gt; =
4)) {</div><div>+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 if=
 (max &gt; (IV)(MAX_LEGAL_CP &gt;&gt; 4)) {</div><div>=C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0s =3D strchr(s, &#39=
;\n&#39;);</div><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0if (s =3D=3D NULL) {</div><div>=C2=A0 =C2=A0 =C2=A0=
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0s =3D=
 e;</div><div>=C2=A0</div></div></div></div>

--000000000000f9e7d4058163ceb7--
0
deven
2/8/2019 3:42:26 PM
--0000000000005fad1205816ed8ee
Content-Type: text/plain; charset="UTF-8"

On Thu, 7 Feb 2019, 03:46 Karl Williamson, <public@khwilliamson.com> wrote:

> On 2/5/19 11:27 PM, demerphq wrote:
> > Fwiw, I don't like it. What happens if the pattern includes capture
> > brackets, named recursion or eval ? This seems like a way to squeeze
> > named recursion concepts into the named property functionality without
> > thinking through the ramifications.
> >
> > Yves
>
> The way it's implemented is a separate regex is compiled and executed
> during the compilation of the outer one.  Maybe you know something about
> how that could fail, but it works in my limited testing, so I'm not sure
> you're stated concerns are valid.
>
> It calls subpattern_re = re_compile(pattern, 0);
> and then pregexec(subpattern_re, ...)
>

I hate to say it Karl, but this is what worries me.

This behavior seems like a poorly thought through attempt to do the same
thing as name recursion, and unless very carefully implemented will result
in terrible performance problems that we will have to sort out, and this
type of implementation will not make it easy.

Consider something like this:

/\p{foo=/whatever/}\p{foo}/

given your proposed implementation how will the optimiser know that this
pattern is equivalent to

/whateverwhatever/

This is what I mean by poorly thought out. How does this integrate with
other behavior, quantifiers and atomic patterns, backtracking and etc?

How about this:

"abbbabababababcab"=~/\p{foo=/[abc]+/}cab/

how will the optimizer backtrack this pattern?

What about

/\p{foo=/(??{rand}/}\p{foo}/

what will that do?

I think this proposal needs a LOT more thought and analysis before it goes
into Perl.

I understand the temptation of "hey, I can trivially bolt this new feature
into Perl", as I have myself been seduced by it in the past, but honestly,
it is a mistake to allow yourself to succumb. It is all too easy to add a
feature using a trivial implementation, but much much more difficult to
address the fallout later when people point out it isn't as efficient as it
should be, or doesn't interact sanely with other features in the regex
engine, or doesn't have a clear definition of the behavior.

I think before this gets added a lot more thought needs to be added, and it
probably cant be implemented as you said, or patterns using it will easily
become quadratic, and then lead to performance complaints.

Some of the questions I have are, how does it interact with capture
buffers, how does it interact with optimizations like the start-class
optimization, mandatory string detection, etc. How does it interact with
(??{...}) and (?{ ... }), how does it interact with the verbs? How does it
interact with $^R and $REGMATCH and $REGERROR? How does it interact with
named recursion? How do we avoid this form of expression becoming quadratic
or disabling optimisations?

For instance what happens here:

/\p{foo=/blah(?<name>...)/}(?&foo)/

Some of my experience is that it was easy to add named recursion to Perl,
but much much harder to optimize the result properly. I had to put
significantly more work into the optimization phase than the actual named
recursion implementation, which was pretty trivially added to the existing
EVAL framework.

I think you need to ask and answer a lot more questions than "is it
anchored" before this goes in.

I am *not* opposed to it going in, but these kind of questions need to be
answered first. So until a much more detailed summary of behavior is
provided I am against this.

To give you an example of my experience with these "neat features", I
implemented (?|...) and it took a few years before some of the flaws were
identified in it, and some of them are still yet to be resolved. I had a
similar experience with named recursion. Given that experience I am now in
the camp that nothing new like this should be added until all these
questions can be answered *first*, and the implementation needs to be smart
enough to resolve those questions in its first release.

So for instance, I could see /\p{foo=/.../}/ being implemented internally
as something like (?(DEFINE)(?<foo>...))\p{foo}/ except that named
recursion assumes that a named pattern is also a numbered capture buffer,
so something would have to be done to address that. Maybe a form of a
recursive subpattern that doesn't capture explicitly,  but then I would
expect it to have an equivalent non \p{...} form, and I wonder how that
would look? Maybe (?<<foo>>...)? Then the implementation would share the
optimisation logic used by the named recursion logic, and we wouldn't have
two totally separate implementations to optimize.

But as is, I think this feature exposes a LOT of questions that need to be
answered before you move forward, and I am VERY doubtful that the naive
implementation you suggest is the right way to do things.

Sorry to be the bearer of bad tidings on this, but once stung, twice shy
and all of that.

Yves

--0000000000005fad1205816ed8ee
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div dir=3D"auto"><div><br><br><div class=3D"gmail_quote">=
<div dir=3D"ltr" class=3D"gmail_attr">On Thu, 7 Feb 2019, 03:46 Karl Willia=
mson, &lt;<a href=3D"mailto:public@khwilliamson.com" target=3D"_blank">publ=
ic@khwilliamson.com</a>&gt; wrote:<br></div><blockquote class=3D"gmail_quot=
e" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">=
On 2/5/19 11:27 PM, demerphq wrote:<br>
&gt; Fwiw, I don&#39;t like it. What happens if the pattern includes captur=
e <br>
&gt; brackets, named recursion or eval ? This seems like a way to squeeze <=
br>
&gt; named recursion concepts into the named property functionality without=
 <br>
&gt; thinking through the ramifications.<br>
&gt; <br>
&gt; Yves<br>
<br>
The way it&#39;s implemented is a separate regex is compiled and executed <=
br>
during the compilation of the outer one.=C2=A0 Maybe you know something abo=
ut <br>
how that could fail, but it works in my limited testing, so I&#39;m not sur=
e <br>
you&#39;re stated concerns are valid.<br>
<br>
It calls subpattern_re =3D re_compile(pattern, 0);<br>
and then pregexec(subpattern_re, ...)<br></blockquote></div></div><div dir=
=3D"auto"><br></div><div>I hate to say it Karl, but this is what worries me=
..</div><div><br></div><div>This behavior seems like a poorly thought throug=
h attempt to do the same thing as name recursion, and unless very carefully=
 implemented will result in terrible performance problems that we will have=
 to sort out, and this type of implementation will not make it easy.</div><=
div><br></div><div>Consider something like this:</div><div><br></div><div>/=
\p{foo=3D/whatever/}\p{foo}/</div><div><br></div><div>given your proposed i=
mplementation how will the optimiser know that this pattern is equivalent t=
o</div><div><br></div><div>/whateverwhatever/</div><div><br></div><div>This=
 is what I mean by poorly thought out. How does this integrate with other b=
ehavior, quantifiers and atomic patterns, backtracking and etc?</div><div><=
br></div><div>How about this:</div><div><br></div><div>&quot;abbbababababab=
cab&quot;=3D~/\p{foo=3D/[abc]+/}cab/</div><div><br></div><div>how will the =
optimizer backtrack this pattern?</div><div><br></div><div>What about</div>=
<div><br></div><div>/\p{foo=3D/(??{rand}/}\p{foo}/</div><div><br></div><div=
>what will that do?</div><div><br></div><div>I think this proposal needs a =
LOT more thought and analysis before it goes into Perl.</div><div><br></div=
><div>I understand the temptation of &quot;hey, I can trivially bolt this n=
ew feature into Perl&quot;, as I have myself been seduced by it in the past=
, but honestly, it is a mistake to allow yourself to succumb. It is all too=
 easy to add a feature using a trivial implementation, but much much more d=
ifficult to address the fallout later when people point out it isn&#39;t as=
 efficient as it should be, or doesn&#39;t interact sanely with other featu=
res in the regex engine, or doesn&#39;t have a clear definition of the beha=
vior.</div><div><br></div><div>I think before this gets added a lot more th=
ought needs to be added, and it probably cant be implemented as you said, o=
r patterns using it will easily become quadratic, and then lead to performa=
nce complaints.</div><div><br></div><div>Some of the questions I have are, =
how does it interact with capture buffers, how does it interact with optimi=
zations like the start-class optimization, mandatory string detection, etc.=
 How does it interact with (??{...}) and (?{ ... }), how does it interact w=
ith the verbs? How does it interact with $^R and $REGMATCH and $REGERROR? H=
ow does it interact with named recursion? How do we avoid this form of expr=
ession becoming quadratic or disabling optimisations?</div><div><br></div><=
div>For instance what happens here:</div><div><br></div><div>/\p{foo=3D/bla=
h(?&lt;name&gt;...)/}(?&amp;foo)/</div><div><br></div><div>Some of my exper=
ience is that it was easy to add named recursion to Perl, but much much har=
der to optimize the result properly. I had to put significantly more work i=
nto the optimization phase than the actual named recursion implementation, =
which was pretty trivially added to the existing EVAL framework.=C2=A0<br><=
/div><div><br></div><div>I think you need to ask and answer a lot more ques=
tions than &quot;is it anchored&quot; before this goes in.=C2=A0</div><div>=
<br></div><div>I am *not* opposed to it going in, but these kind of questio=
ns need to be answered first. So until a much more detailed summary of beha=
vior is provided I am against this.</div><div><br></div><div>To give you an=
 example of my experience with these &quot;neat features&quot;, I implement=
ed (?|...) and it took a few years before some of the flaws were identified=
 in it, and some of them are still yet to be resolved. I had a similar expe=
rience with named recursion. Given that experience I am now in the camp tha=
t nothing new like this should be added until all these questions can be an=
swered *first*, and the implementation needs to be smart enough to resolve =
those questions in its first release.</div><div><br></div><div>So for insta=
nce, I could see /\p{foo=3D/.../}/ being implemented internally as somethin=
g like (?(DEFINE)(?&lt;foo&gt;...))\p{foo}/ except that named recursion ass=
umes that a named pattern is also a numbered capture buffer, so something w=
ould have to be done to address that. Maybe a form of a recursive subpatter=
n that doesn&#39;t capture explicitly,=C2=A0 but then I would expect it to =
have an equivalent non \p{...} form, and I wonder how that would look? Mayb=
e (?&lt;&lt;foo&gt;&gt;...)? Then the implementation would share the optimi=
sation logic used by the named recursion logic, and we wouldn&#39;t have tw=
o totally separate implementations to optimize.</div><div><br></div><div>Bu=
t as is, I think this feature exposes a LOT of questions that need to be an=
swered before you move forward, and I am VERY doubtful that the naive imple=
mentation you suggest is the right way to do things.</div><div><br></div><d=
iv>Sorry to be the bearer of bad tidings on this, but once stung, twice shy=
 and all of that.</div><div><br></div><div>Yves</div><div><br></div></div>
</div>

--0000000000005fad1205816ed8ee--
0
demerphq
2/9/2019 4:52:22 AM
On Sat, 9 Feb 2019 at 05:52, demerphq <demerphq@gmail.com> wrote:
>
>
>
> On Thu, 7 Feb 2019, 03:46 Karl Williamson, <public@khwilliamson.com> wrot=
e:
>>
>> On 2/5/19 11:27 PM, demerphq wrote:
>> > Fwiw, I don't like it. What happens if the pattern includes capture
>> > brackets, named recursion or eval ? This seems like a way to squeeze
>> > named recursion concepts into the named property functionality without
>> > thinking through the ramifications.
>> >
>> > Yves
>>
>> The way it's implemented is a separate regex is compiled and executed
>> during the compilation of the outer one.  Maybe you know something about
>> how that could fail, but it works in my limited testing, so I'm not sure
>> you're stated concerns are valid.
>>
>> It calls subpattern_re =3D re_compile(pattern, 0);
>> and then pregexec(subpattern_re, ...)
>
>
> I hate to say it Karl, but this is what worries me.
>
> This behavior seems like a poorly thought through attempt to do the same =
thing as name recursion, and unless very carefully implemented will result =
in terrible performance problems that we will have to sort out, and this ty=
pe of implementation will not make it easy.
>
> Consider something like this:
>
> /\p{foo=3D/whatever/}\p{foo}/
>
> given your proposed implementation how will the optimiser know that this =
pattern is equivalent to
>
> /whateverwhatever/
>
> This is what I mean by poorly thought out. How does this integrate with o=
ther behavior, quantifiers and atomic patterns, backtracking and etc?
>
> How about this:
>
> "abbbabababababcab"=3D~/\p{foo=3D/[abc]+/}cab/
>
> how will the optimizer backtrack this pattern?
>
> What about
>
> /\p{foo=3D/(??{rand}/}\p{foo}/
>
> what will that do?
>
> I think this proposal needs a LOT more thought and analysis before it goe=
s into Perl.
>
> I understand the temptation of "hey, I can trivially bolt this new featur=
e into Perl", as I have myself been seduced by it in the past, but honestly=
, it is a mistake to allow yourself to succumb. It is all too easy to add a=
 feature using a trivial implementation, but much much more difficult to ad=
dress the fallout later when people point out it isn't as efficient as it s=
hould be, or doesn't interact sanely with other features in the regex engin=
e, or doesn't have a clear definition of the behavior.
>
> I think before this gets added a lot more thought needs to be added, and =
it probably cant be implemented as you said, or patterns using it will easi=
ly become quadratic, and then lead to performance complaints.
>
> Some of the questions I have are, how does it interact with capture buffe=
rs, how does it interact with optimizations like the start-class optimizati=
on, mandatory string detection, etc. How does it interact with (??{...}) an=
d (?{ ... }), how does it interact with the verbs? How does it interact wit=
h $^R and $REGMATCH and $REGERROR? How does it interact with named recursio=
n? How do we avoid this form of expression becoming quadratic or disabling =
optimisations?

Another question is : Does PCRE or any other regex engine support this
already? What semantics do they expose?

Yves
0
demerphq
2/9/2019 4:54:12 AM
On Thu, 7 Feb 2019 at 06:02, Deven T. Corzine <deven@ties.org> wrote:
>
>
>
> On Wed, Feb 6, 2019 at 2:47 PM Karl Williamson <public@khwilliamson.com> wrote:
>>
>> On 2/5/19 11:27 PM, demerphq wrote:
>> > Fwiw, I don't like it. What happens if the pattern includes capture
>> > brackets, named recursion or eval ? This seems like a way to squeeze
>> > named recursion concepts into the named property functionality without
>> > thinking through the ramifications.
>> >
>> > Yves
>
>
> Yves, do you still have concerns if the property regular expression is evaluated independently?

Yes I do have concerns. I replied in detail in another email, but to
summarize succinctly, there are many features in the regex engine, how
does this new proposal interact with them? How do we ensure that using
this feature does not result in quadratic performance when an
equivalent pattern using a different feature set would be linear?

Yves

-- 
perl -Mre=debug -e "/just|another|perl|hacker/"
0
demerphq
2/9/2019 4:56:06 AM
--000000000000ea49e105816f51a3
Content-Type: text/plain; charset="UTF-8"

On Fri, Feb 8, 2019 at 11:56 PM demerphq <demerphq@gmail.com> wrote:

> Yes I do have concerns. I replied in detail in another email, but to
> summarize succinctly, there are many features in the regex engine, how
> does this new proposal interact with them? How do we ensure that using
> this feature does not result in quadratic performance when an
> equivalent pattern using a different feature set would be linear?
>

I saw your other email, but I think this is something different which
shouldn't be like named recursion.

Quote from the UTS 18 link: "this feature allows the use of a regular
expression to pick out a set of characters based on whether the property
values match the regular expression."

If I understand correctly, any regex used in this mechanism would match
against property values of the Unicode character set, NOT against arbitrary
text.  Since the Unicode data is static, I see no reason why the property
regex shouldn't be compiled independently AND executed immediately, while
compiling the containing regex.  The results should then function as a
fixed predefined character class of Unicode characters, much like a POSIX
character class but specified in a more dynamic and flexible manner.  The
containing regex should be able to include this property-based character
class inside a normal character class.  Since the property regex can be
executed at compile time, there is no risk of making regular expressions
turn quadratic, nor should there be interactions from captures or anything
else.

For example, from UTS 18 again, the property value \p{toNfd=/b/} could be
compiled into [\x{0062}\x{1e03}\x{1e05}\x{1e07}], with the same exact
runtime semantics and performance characteristics, and the property
value \p{name=/^LATIN LETTER.*P$/} could be similarly compiled into
[\x{01aa}\x{0294}\x{0296}\x{1d18}], etc.

If these property regular expressions were compiled and executed at compile
time like this, and turned into straightforward Unicode character classes
to use at runtime, wouldn't that avoid the concerns you mentioned in the
other email?

Deven

--000000000000ea49e105816f51a3
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div dir=3D"ltr"><div dir=3D"ltr"><div dir=3D"ltr"><div di=
r=3D"ltr">On Fri, Feb 8, 2019 at 11:56 PM demerphq &lt;<a href=3D"mailto:de=
merphq@gmail.com">demerphq@gmail.com</a>&gt; wrote:<br></div><div class=3D"=
gmail_quote"><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px =
0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Yes I do hav=
e concerns. I replied in detail in another email, but to<br>
summarize succinctly, there are many features in the regex engine, how<br>
does this new proposal interact with them? How do we ensure that using<br>
this feature does not result in quadratic performance when an<br>
equivalent pattern using a different feature set would be linear?<br></bloc=
kquote><div><br></div><div>I saw your other email, but I think this is some=
thing different which shouldn&#39;t be like named recursion.</div><div><br>=
</div><div>Quote from the UTS 18 link: &quot;this feature allows the use of=
 a regular expression to pick out a set of characters based on whether the =
property values match the regular expression.&quot;</div><div><br></div><di=
v>If I understand correctly, any regex used in this mechanism would match a=
gainst property values of the Unicode character set, NOT against arbitrary =
text.=C2=A0 Since the Unicode data is static, I see no reason why the prope=
rty regex shouldn&#39;t be compiled independently AND executed immediately,=
 while compiling the containing regex.=C2=A0 The results should then functi=
on as a fixed predefined character class of Unicode characters, much like a=
 POSIX character class but specified in a more dynamic and flexible manner.=
=C2=A0 The containing regex should be able to include this property-based c=
haracter class inside a normal character class.=C2=A0 Since the property re=
gex can be executed at compile time, there is no risk of making regular exp=
ressions turn quadratic, nor should there be interactions from captures or =
anything else.</div><div><br></div><div>For example, from UTS 18 again, the=
 property value \p{toNfd=3D/b/} could be compiled into [\x{0062}\x{1e03}\x{=
1e05}\x{1e07}], with the same exact runtime semantics and performance chara=
cteristics, and the property value=C2=A0\p{name=3D/^LATIN LETTER.*P$/} coul=
d be similarly compiled into [\x{01aa}\x{0294}\x{0296}\x{1d18}], etc.</div>=
<div><br></div><div>If these property regular expressions were compiled and=
 executed at compile time like this, and turned into straightforward Unicod=
e character classes to use at runtime, wouldn&#39;t that avoid the concerns=
 you mentioned in the other email?</div><div><br></div><div>Deven</div><div=
><br></div></div></div></div></div></div>

--000000000000ea49e105816f51a3--
0
deven
2/9/2019 5:26:28 AM
--000000000000f71a310581707e1f
Content-Type: text/plain; charset="UTF-8"

On Sat, 9 Feb 2019, 13:26 Deven T. Corzine, <deven@ties.org> wrote:

> On Fri, Feb 8, 2019 at 11:56 PM demerphq <demerphq@gmail.com> wrote:
>
>> Yes I do have concerns. I replied in detail in another email, but to
>> summarize succinctly, there are many features in the regex engine, how
>> does this new proposal interact with them? How do we ensure that using
>> this feature does not result in quadratic performance when an
>> equivalent pattern using a different feature set would be linear?
>>
>
> I saw your other email, but I think this is something different which
> shouldn't be like named recursion.
>
> Quote from the UTS 18 link: "this feature allows the use of a regular
> expression to pick out a set of characters based on whether the property
> values match the regular expression."
>
> If I understand correctly, any regex used in this mechanism would match
> against property values of the Unicode character set, NOT against arbitrary
> text.  Since the Unicode data is static, I see no reason why the property
> regex shouldn't be compiled independently AND executed immediately, while
> compiling the containing regex.  The results should then function as a
> fixed predefined character class of Unicode characters, much like a POSIX
> character class but specified in a more dynamic and flexible manner.  The
> containing regex should be able to include this property-based character
> class inside a normal character class.  Since the property regex can be
> executed at compile time, there is no risk of making regular expressions
> turn quadratic, nor should there be interactions from captures or anything
> else.
>
> For example, from UTS 18 again, the property value \p{toNfd=/b/} could be
> compiled into [\x{0062}\x{1e03}\x{1e05}\x{1e07}], with the same exact
> runtime semantics and performance characteristics, and the property
> value \p{name=/^LATIN LETTER.*P$/} could be similarly compiled into
> [\x{01aa}\x{0294}\x{0296}\x{1d18}], etc.
>
> If these property regular expressions were compiled and executed at
> compile time like this, and turned into straightforward Unicode character
> classes to use at runtime, wouldn't that avoid the concerns you mentioned
> in the other email?
>

Answering very quickly (I am on holiday) I will say that if what you are
saying is correct that this is a way to define a character class and that
it results in a first order compiled character class then I have no
objections other than the syntax being very misleading in form. *But* that
doesn't seem to match what Karl said in terms of implementation which looks
much closer to an eval/recursion group.

Yves

>

--000000000000f71a310581707e1f
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"auto"><div class=3D"gmail_quote" dir=3D"auto"><div dir=3D"ltr" =
class=3D"gmail_attr">On Sat, 9 Feb 2019, 13:26 Deven T. Corzine, &lt;<a hre=
f=3D"mailto:deven@ties.org">deven@ties.org</a>&gt; wrote:<br></div><blockqu=
ote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc s=
olid;padding-left:1ex"><div dir=3D"ltr"><div dir=3D"ltr"><div dir=3D"ltr"><=
div dir=3D"ltr"><div dir=3D"ltr">On Fri, Feb 8, 2019 at 11:56 PM demerphq &=
lt;<a href=3D"mailto:demerphq@gmail.com" target=3D"_blank" rel=3D"noreferre=
r">demerphq@gmail.com</a>&gt; wrote:<br></div><div class=3D"gmail_quote"><b=
lockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-le=
ft:1px solid rgb(204,204,204);padding-left:1ex">Yes I do have concerns. I r=
eplied in detail in another email, but to<br>
summarize succinctly, there are many features in the regex engine, how<br>
does this new proposal interact with them? How do we ensure that using<br>
this feature does not result in quadratic performance when an<br>
equivalent pattern using a different feature set would be linear?<br></bloc=
kquote><div><br></div><div>I saw your other email, but I think this is some=
thing different which shouldn&#39;t be like named recursion.</div><div><br>=
</div><div>Quote from the UTS 18 link: &quot;this feature allows the use of=
 a regular expression to pick out a set of characters based on whether the =
property values match the regular expression.&quot;</div><div><br></div><di=
v>If I understand correctly, any regex used in this mechanism would match a=
gainst property values of the Unicode character set, NOT against arbitrary =
text.=C2=A0 Since the Unicode data is static, I see no reason why the prope=
rty regex shouldn&#39;t be compiled independently AND executed immediately,=
 while compiling the containing regex.=C2=A0 The results should then functi=
on as a fixed predefined character class of Unicode characters, much like a=
 POSIX character class but specified in a more dynamic and flexible manner.=
=C2=A0 The containing regex should be able to include this property-based c=
haracter class inside a normal character class.=C2=A0 Since the property re=
gex can be executed at compile time, there is no risk of making regular exp=
ressions turn quadratic, nor should there be interactions from captures or =
anything else.</div><div><br></div><div>For example, from UTS 18 again, the=
 property value \p{toNfd=3D/b/} could be compiled into [\x{0062}\x{1e03}\x{=
1e05}\x{1e07}], with the same exact runtime semantics and performance chara=
cteristics, and the property value=C2=A0\p{name=3D/^LATIN LETTER.*P$/} coul=
d be similarly compiled into [\x{01aa}\x{0294}\x{0296}\x{1d18}], etc.</div>=
<div><br></div><div>If these property regular expressions were compiled and=
 executed at compile time like this, and turned into straightforward Unicod=
e character classes to use at runtime, wouldn&#39;t that avoid the concerns=
 you mentioned in the other email?</div></div></div></div></div></div></blo=
ckquote></div><div dir=3D"auto"><br></div><div dir=3D"auto">Answering very =
quickly (I am on holiday) I will say that if what you are saying is correct=
 that this is a way to define a character class and that it results in a fi=
rst order compiled character class then I have no objections other than the=
 syntax being very misleading in form. *But* that doesn&#39;t seem to match=
 what Karl said in terms of implementation which looks much closer to an ev=
al/recursion group.</div><div dir=3D"auto"><br></div><div dir=3D"auto">Yves=
</div><div class=3D"gmail_quote" dir=3D"auto"><blockquote class=3D"gmail_qu=
ote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex=
">
</blockquote></div></div>

--000000000000f71a310581707e1f--
0
demerphq
2/9/2019 6:50:39 AM
--000000000000469d8c05817288a8
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

On Sat, Feb 9, 2019 at 1:50 AM demerphq <demerphq@gmail.com> wrote:

> Answering very quickly (I am on holiday) I will say that if what you are
> saying is correct that this is a way to define a character class and that
> it results in a first order compiled character class then I have no
> objections other than the syntax being very misleading in form. *But* tha=
t
> doesn't seem to match what Karl said in terms of implementation which loo=
ks
> much closer to an eval/recursion group.
>
> Yves
>
>>
When Karl asked for comments about this feature, I read through the section
of the UTS 18 document that Karl linked to, and I think I understand what
they=E2=80=99re describing.  My description of compiling the property expre=
ssion
into a character class is hypothetical, based on how I would approach
implementing the feature efficiently.  I don=E2=80=99t know how Karl=E2=80=
=99s prototype
works, but it does sound like it may be recursive.

Karl, can you enlighten us?  Are you recursing into a subpattern at
runtime? What do you think of the hypothetical approach I described?

Deven

--000000000000469d8c05817288a8
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div>On Sat, Feb 9, 2019 at 1:50 AM demerphq &lt;<a href=3D"mailto:demerphq=
@gmail.com">demerphq@gmail.com</a>&gt; wrote:<br></div><div><div class=3D"g=
mail_quote"><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;bo=
rder-left:1px #ccc solid;padding-left:1ex"><div dir=3D"auto"><div dir=3D"au=
to">Answering very quickly (I am on holiday) I will say that if what you ar=
e saying is correct that this is a way to define a character class and that=
 it results in a first order compiled character class then I have no object=
ions other than the syntax being very misleading in form. *But* that doesn&=
#39;t seem to match what Karl said in terms of implementation which looks m=
uch closer to an eval/recursion group.</div><div dir=3D"auto"><br></div><di=
v dir=3D"auto">Yves</div><div class=3D"gmail_quote" dir=3D"auto"><blockquot=
e class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc sol=
id;padding-left:1ex">
</blockquote></div></div>
</blockquote></div></div><div dir=3D"auto"><br></div><div dir=3D"auto">When=
 Karl asked for comments about this feature, I read through the section of =
the UTS 18 document that Karl linked to, and I think I understand what they=
=E2=80=99re describing.=C2=A0 My description of compiling the property expr=
ession into a character class is hypothetical, based on how I would approac=
h implementing the feature efficiently.=C2=A0 I don=E2=80=99t know how Karl=
=E2=80=99s prototype works, but it does sound like it may be recursive.</di=
v><div dir=3D"auto"><br></div><div dir=3D"auto">Karl, can you enlighten us?=
=C2=A0 Are you recursing into a subpattern at runtime? What do you think of=
 the hypothetical approach I described?</div><div dir=3D"auto"><br></div><d=
iv dir=3D"auto">Deven</div><div dir=3D"auto"><br></div>

--000000000000469d8c05817288a8--
0
deven
2/9/2019 9:16:25 AM
--000000000000624751058172b57f
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

On Sat, Feb 9, 2019 at 4:16 AM Deven T. Corzine <deven@ties.org> wrote:

> Karl, can you enlighten us?  Are you recursing into a subpattern at
> runtime? What do you think of the hypothetical approach I described?
>

I just read Karl=E2=80=99s description again: =E2=80=9CThe way it's impleme=
nted is a
separate regex is compiled and executed
during the compilation of the outer one.=E2=80=9D

I didn=E2=80=99t notice the =E2=80=9Cand executed=E2=80=9D part the first t=
ime.  That sounds
exactly like the hypothetical implementation that I described, actually...

Deven

--000000000000624751058172b57f
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div>On Sat, Feb 9, 2019 at 4:16 AM Deven T. Corzine &lt;<a href=3D"mailto:=
deven@ties.org">deven@ties.org</a>&gt; wrote:<br></div><div><div class=3D"g=
mail_quote"><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;bo=
rder-left:1px #ccc solid;padding-left:1ex"><div>Karl, can you enlighten us?=
=C2=A0 Are you recursing into a subpattern at runtime? What do you think of=
 the hypothetical approach I described?</div></blockquote><div dir=3D"auto"=
><br></div><div dir=3D"auto">I just read Karl=E2=80=99s description again: =
=E2=80=9C<span style=3D"color:rgb(49,49,49);word-spacing:1px;background-col=
or:rgb(255,255,255)">The way it&#39;s implemented is a separate regex is co=
mpiled and executed</span><span style=3D"color:rgb(49,49,49);word-spacing:1=
px;background-color:rgb(255,255,255)">=C2=A0</span></div><span style=3D"col=
or:rgb(49,49,49);word-spacing:1px;background-color:rgb(255,255,255)">during=
 the compilation of the outer one.=E2=80=9D</span></div><div class=3D"gmail=
_quote" dir=3D"auto"><span style=3D"color:rgb(49,49,49);word-spacing:1px;ba=
ckground-color:rgb(255,255,255)"><br></span></div><div class=3D"gmail_quote=
" dir=3D"auto"><span style=3D"color:rgb(49,49,49);word-spacing:1px;backgrou=
nd-color:rgb(255,255,255)">I didn=E2=80=99t notice the =E2=80=9Cand execute=
d=E2=80=9D part the first time.=C2=A0 That sounds exactly like the hypothet=
ical implementation that I described, actually...</span></div></div><div cl=
ass=3D"gmail_quote" dir=3D"auto"><span style=3D"color:rgb(49,49,49);word-sp=
acing:1px;background-color:rgb(255,255,255)"><br></span></div><div class=3D=
"gmail_quote" dir=3D"auto"><span style=3D"color:rgb(49,49,49);word-spacing:=
1px;background-color:rgb(255,255,255)">Deven</span></div><div class=3D"gmai=
l_quote" dir=3D"auto"><span style=3D"color:rgb(49,49,49);word-spacing:1px;b=
ackground-color:rgb(255,255,255)"><br></span></div>

--000000000000624751058172b57f--
0
deven
2/9/2019 9:29:02 AM
On 2/9/19 2:29 AM, Deven T. Corzine wrote:
> On Sat, Feb 9, 2019 at 4:16 AM Deven T. Corzine <deven@ties.org=20
> <mailto:deven@ties.org>> wrote:
>=20
>     Karl, can you enlighten us?=C2=A0 Are you recursing into a subpatte=
rn at
>     runtime? What do you think of the hypothetical approach I described=
?
>=20
>=20
> I just read Karl=E2=80=99s description again: =E2=80=9CThe way it's imp=
lemented is a=20
> separate regex is compiled and executed
> during the compilation of the outer one.=E2=80=9D
>=20
> I didn=E2=80=99t notice the =E2=80=9Cand executed=E2=80=9D part the fir=
st time.=C2=A0 That sounds=20
> exactly like the hypothetical implementation that I described, actually=
....
>=20
> Deven
>=20

I'm sorry for not being clear.  Deven is correct that his hypothetical=20
implementation is what I have done.

This is a bolt-on feature to the Perl's regexes.  It implements a=20
portion of the wildcard feature of what UTS 18 asks for, using their=20
syntax.  It is an apparent goal, as long listed in perlunicode, to do as=20
much of UTS 18 as we can.

And the implementation isn't efficient.

It is implemented by, during the compilation of a character class,=20
interrupting that compilation, assembling an inner pattern, then=20
compiling that and executing it to find all the code points it matches.=20
That list is then added to whatever else is in the character class, the=20
inner pattern's space is freed, and compilation of the outer pattern=20
resumed.  There is no recursive execution.  But there is recursion in=20
the sense, as I described, that a second pattern is compiled while in=20
the middle of compiling an outer pattern.  I don't know if that is an=20
issue or not.  The patterns do not share anything, no groups, etc.

I've learned that a feature like this should be marked as experimental,=20
so that it can be refined or even removed, and marking it as such lowers=20
expectations as to its well-thought-outness and bug-free-ness.  It=20
allows us to try things out and get feedback without having to say we=20
think it is fully done.  The prototype is so marked.

I've also learned that inefficiencies in compilation don't really=20
matter.  I removed an entire pass of the regex compilation process, with=20
extra mallocs being the price.  There did not seem to be a noticeable=20
change in the speed of execution of our test suite!  This inefficient=20
implementation (and I don't know another way to do it) won't be=20
noticeable in the end, because it's only done at compilation.

I believe PCRE doesn't do this; I don't know about other engines.  But=20
if no one does, I would think that us having a feature no one else does=20
is a selling point.  If others do, we could perhaps learn from their=20
syntax.  A quick google search didn't turn up anything obvious.

If there are issues with various constructs, we can forbid those.  My=20
implementation, for example, doesn't allow braces in the subpattern, and=20
hence no construct that requires braces.  I think that's a reasonable=20
initial restriction to make it easier to implement something, that=20
otherwise wouldn't get implemented.

If the UTS 18 syntax is misleading, what isn't?
0
public
2/9/2019 5:01:08 PM
On Sat, 9 Feb 2019, Karl Williamson wrote:

> I believe PCRE doesn't do this; 

For the record, you are correct. Also, I think that the chance of PCRE ever
doing it is vanishingly small.

Regards,
Philip

-- 
Philip Hazel
0
ph10
2/9/2019 5:17:23 PM
--00000000000076aabe0581a822ff
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

On Sat, Feb 9, 2019 at 12:01 PM Karl Williamson <public@khwilliamson.com>
wrote:

> I'm sorry for not being clear.  Deven is correct that his hypothetical
> implementation is what I have done.
>

That=E2=80=99s good to hear!  I was hoping it would be implemented in such =
a
fashion.

This is a bolt-on feature to the Perl's regexes.  It implements a
> portion of the wildcard feature of what UTS 18 asks for, using their
> syntax.  It is an apparent goal, as long listed in perlunicode, to do as
> much of UTS 18 as we can.


Using their syntax seems worthwhile even if we already deviate elsewhere.

However, I must say that their last example in the table doesn't make sense
to me

Characters in the Letterlike symbol block with different toLowercase values=
:
>      [\p{toLowercase=E2=89=A0@cp@} & \p{Block=3DLetterlike Symbols}]


This seems to imply some sort of boolean logic, which sounds good in
principle, but this syntax seems bizarre to me.  I would expect each
\p{...} expression to be independent, but if they want two different
property matches to apply as a set intersection, I think one of these
examples this would be a more reasonable syntax:

     \p{toLowercase=E2=89=A0@cp@ & Block=3DLetterlike Symbols}
or
     \p{{toLowercase=E2=89=A0@cp@} & {Block=3DLetterlike Symbols}}

Also, on the topic of syntax, since these are meant to be used for sets of
characters that can be used in a character class, I would suggest that
these \p{...} expressions should also work *outside* square brackets as
well, and imply [\p{...}] if the square brackets are omitted.  (Perhaps you
already do this too?)

And the implementation isn't efficient.
>
> It is implemented by, during the compilation of a character class,
> interrupting that compilation, assembling an inner pattern, then
> compiling that and executing it to find all the code points it matches.
> That list is then added to whatever else is in the character class, the
> inner pattern's space is freed, and compilation of the outer pattern
> resumed.  There is no recursive execution.  But there is recursion in
> the sense, as I described, that a second pattern is compiled while in
> the middle of compiling an outer pattern.  I don't know if that is an
> issue or not.  The patterns do not share anything, no groups, etc.


As long as it's all compile-time, it's probably plenty efficient enough
already.  Still, it might be worth keeping a cache of the \p{...}
expressions used and the set of Unicode characters each generated, to avoid
incurring the cost of generating the set if the same expressions are used
over and over again.  The cache could be discarded at the end of the
compilation phase, either for the one containing regex, or (perhaps better)
after compiling the entire program.  Beyond that, I'm not sure what else
could be done to optimize it much more.


> I've learned that a feature like this should be marked as experimental,
> so that it can be refined or even removed, and marking it as such lowers
> expectations as to its well-thought-outness and bug-free-ness.  It
> allows us to try things out and get feedback without having to say we
> think it is fully done.  The prototype is so marked.
>

Good idea, especially since a later official Unicode standard could change.


> I've also learned that inefficiencies in compilation don't really
> matter.  I removed an entire pass of the regex compilation process, with
> extra mallocs being the price.  There did not seem to be a noticeable
> change in the speed of execution of our test suite!  This inefficient
> implementation (and I don't know another way to do it) won't be
> noticeable in the end, because it's only done at compilation.
>

I would agree with this.  You're calling this implementation inefficient,
but I'm not sure that word applies if there isn't a substantially better
way to do it.  Creating a fixed character set at compile time is the thing
that will make this efficient at runtime, and as long as the cost at
compile time is small, it's not likely to even be noticed.


> I believe PCRE doesn't do this; I don't know about other engines.  But
> if no one does, I would think that us having a feature no one else does
> is a selling point.  If others do, we could perhaps learn from their
> syntax.  A quick google search didn't turn up anything obvious.
>

I doubt anyone else does it yet.  If Perl has it, perhaps PCRE would
consider copying it later to try to maintain better compatibility with
Perl, but they might not even bother.


> If there are issues with various constructs, we can forbid those.  My
> implementation, for example, doesn't allow braces in the subpattern, and
> hence no construct that requires braces.  I think that's a reasonable
> initial restriction to make it easier to implement something, that
> otherwise wouldn't get implemented.
>

It would be good to support balanced/escaped braces, but that can certainly
be a second pass...


> If the UTS 18 syntax is misleading, what isn't?
>

I'm not even sure what you mean by this!

Deven

--00000000000076aabe0581a822ff
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div dir=3D"ltr"><div dir=3D"ltr"><div>On Sat, Feb 9, 2019=
 at 12:01 PM Karl Williamson &lt;<a href=3D"mailto:public@khwilliamson.com"=
 target=3D"_blank">public@khwilliamson.com</a>&gt; wrote:<br></div><div><di=
v class=3D"gmail_quote"><blockquote class=3D"gmail_quote" style=3D"margin:0=
px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">I=
&#39;m sorry for not being clear.=C2=A0 Deven is correct that his hypotheti=
cal <br>
implementation is what I have done.<br>
</blockquote><div dir=3D"auto"><br></div><div dir=3D"auto">That=E2=80=99s g=
ood to hear!=C2=A0 I was hoping it would be implemented in such a fashion.<=
/div><div dir=3D"auto"><br></div><blockquote class=3D"gmail_quote" style=3D=
"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-le=
ft:1ex">This is a bolt-on feature to the Perl&#39;s regexes.=C2=A0 It imple=
ments a <br>
portion of the wildcard feature of what UTS 18 asks for, using their <br>
syntax.=C2=A0 It is an apparent goal, as long listed in perlunicode, to do =
as <br>
much of UTS 18 as we can.</blockquote><div dir=3D"auto"><br></div><div dir=
=3D"auto">Using their syntax seems worthwhile even if we already deviate el=
sewhere.</div><div dir=3D"auto"><br></div><div>However, I must say that the=
ir last example in the table doesn&#39;t make sense to me</div><div><br></d=
iv><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;bord=
er-left:1px solid rgb(204,204,204);padding-left:1ex">Characters in the Lett=
erlike symbol block with different toLowercase values:<br>=C2=A0 =C2=A0 =C2=
=A0[\p{toLowercase=E2=89=A0@cp@} &amp; \p{Block=3DLetterlike Symbols}]</blo=
ckquote><div><br></div><div>This seems to imply some sort of boolean logic,=
 which sounds good in principle, but this syntax seems bizarre to me.=C2=A0=
 I would expect each \p{...} expression to be independent, but if they want=
 two different property matches to apply as a set intersection, I think one=
 of these examples this would be a more reasonable syntax:</div><div><br></=
div><div>=C2=A0 =C2=A0 =C2=A0\p{toLowercase=E2=89=A0@cp@ &amp; Block=3DLett=
erlike Symbols}<br></div><div>or</div><div>=C2=A0 =C2=A0 =C2=A0\p{{toLowerc=
ase=E2=89=A0@cp@} &amp; {Block=3DLetterlike Symbols}}</div><div dir=3D"auto=
"><br></div><div>Also, on the topic of syntax, since these are meant to be =
used for sets of characters that can be used in a character class, I would =
suggest that these \p{...} expressions should also work *outside* square br=
ackets as well, and imply [\p{...}] if the square brackets are omitted.=C2=
=A0 (Perhaps you already do this too?)</div><div dir=3D"auto"><br></div><bl=
ockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-lef=
t:1px solid rgb(204,204,204);padding-left:1ex">And the implementation isn&#=
39;t efficient.<br>
<br>
It is implemented by, during the compilation of a character class, <br>
interrupting that compilation, assembling an inner pattern, then <br>
compiling that and executing it to find all the code points it matches. <br=
>
That list is then added to whatever else is in the character class, the <br=
>
inner pattern&#39;s space is freed, and compilation of the outer pattern <b=
r>
resumed.=C2=A0 There is no recursive execution.=C2=A0 But there is recursio=
n in <br>
the sense, as I described, that a second pattern is compiled while in <br>
the middle of compiling an outer pattern.=C2=A0 I don&#39;t know if that is=
 an <br>
issue or not.=C2=A0 The patterns do not share anything, no groups, etc.</bl=
ockquote><div><br></div><div>As long as it&#39;s all compile-time, it&#39;s=
 probably plenty efficient enough already.=C2=A0 Still, it might be worth k=
eeping a cache of the \p{...} expressions used and the set of Unicode chara=
cters each generated, to avoid incurring the cost of generating the set if =
the same expressions are used over and over again.=C2=A0 The cache could be=
 discarded at the end of the compilation phase, either for the one containi=
ng regex, or (perhaps better) after compiling the entire program.=C2=A0 Bey=
ond that, I&#39;m not sure what else could be done to optimize it much more=
..<br></div><div>=C2=A0</div><blockquote class=3D"gmail_quote" style=3D"marg=
in:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1e=
x">
I&#39;ve learned that a feature like this should be marked as experimental,=
 <br>
so that it can be refined or even removed, and marking it as such lowers <b=
r>
expectations as to its well-thought-outness and bug-free-ness.=C2=A0 It <br=
>
allows us to try things out and get feedback without having to say we <br>
think it is fully done.=C2=A0 The prototype is so marked.<br>
</blockquote><div><br></div><div>Good idea, especially since a later offici=
al Unicode standard could change.</div><div>=C2=A0</div><blockquote class=
=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rg=
b(204,204,204);padding-left:1ex">
I&#39;ve also learned that inefficiencies in compilation don&#39;t really <=
br>
matter.=C2=A0 I removed an entire pass of the regex compilation process, wi=
th <br>
extra mallocs being the price.=C2=A0 There did not seem to be a noticeable =
<br>
change in the speed of execution of our test suite!=C2=A0 This inefficient =
<br>
implementation (and I don&#39;t know another way to do it) won&#39;t be <br=
>
noticeable in the end, because it&#39;s only done at compilation.<br></bloc=
kquote><div><br></div><div>I would agree with this.=C2=A0 You&#39;re callin=
g this implementation inefficient, but I&#39;m not sure that word applies i=
f there isn&#39;t a substantially better way to do it.=C2=A0 Creating a fix=
ed character set at compile time is the thing that will make this efficient=
 at runtime, and as long as the cost at compile time is small, it&#39;s not=
 likely to even be noticed.</div><div>=C2=A0</div><blockquote class=3D"gmai=
l_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,20=
4,204);padding-left:1ex">I believe PCRE doesn&#39;t do this; I don&#39;t kn=
ow about other engines.=C2=A0 But <br>
if no one does, I would think that us having a feature no one else does <br=
>
is a selling point.=C2=A0 If others do, we could perhaps learn from their <=
br>
syntax.=C2=A0 A quick google search didn&#39;t turn up anything obvious.<br=
></blockquote><div><br></div><div>I doubt anyone else does it yet.=C2=A0 If=
 Perl has it, perhaps PCRE would consider copying it later to try to mainta=
in better compatibility with Perl, but they might not even bother.</div><di=
v>=C2=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px=
 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
If there are issues with various constructs, we can forbid those.=C2=A0 My =
<br>
implementation, for example, doesn&#39;t allow braces in the subpattern, an=
d <br>
hence no construct that requires braces.=C2=A0 I think that&#39;s a reasona=
ble <br>
initial restriction to make it easier to implement something, that <br>
otherwise wouldn&#39;t get implemented.<br></blockquote><div><br></div><div=
>It would be good to support balanced/escaped braces, but that can certainl=
y be a second pass...</div><div>=C2=A0</div><blockquote class=3D"gmail_quot=
e" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204)=
;padding-left:1ex">
If the UTS 18 syntax is misleading, what isn&#39;t?<br></blockquote><div><b=
r></div><div>I&#39;m not even sure what you mean by this!</div><div><br></d=
iv><div>Deven</div><div>=C2=A0</div></div></div>
</div></div></div>

--00000000000076aabe0581a822ff--
0
deven
2/12/2019 1:13:24 AM
--000000000000f3b83e0581ac2fbd
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

On Sun, 10 Feb 2019, 01:01 Karl Williamson, <public@khwilliamson.com> wrote=
:

> On 2/9/19 2:29 AM, Deven T. Corzine wrote:
> > On Sat, Feb 9, 2019 at 4:16 AM Deven T. Corzine <deven@ties.org
> > <mailto:deven@ties.org>> wrote:
> >
> >     Karl, can you enlighten us?  Are you recursing into a subpattern at
> >     runtime? What do you think of the hypothetical approach I described=
?
> >
> >
> > I just read Karl=E2=80=99s description again: =E2=80=9CThe way it's imp=
lemented is a
> > separate regex is compiled and executed
> > during the compilation of the outer one.=E2=80=9D
> >
> > I didn=E2=80=99t notice the =E2=80=9Cand executed=E2=80=9D part the fir=
st time.  That sounds
> > exactly like the hypothetical implementation that I described,
> actually...
> >
> > Deven
> >
>
> I'm sorry for not being clear.  Deven is correct that his hypothetical
> implementation is what I have done.
>
> This is a bolt-on feature to the Perl's regexes.  It implements a
> portion of the wildcard feature of what UTS 18 asks for, using their
> syntax.  It is an apparent goal, as long listed in perlunicode, to do as
> much of UTS 18 as we can.
>
> And the implementation isn't efficient.
>
> It is implemented by, during the compilation of a character class,
> interrupting that compilation, assembling an inner pattern, then
> compiling that and executing it to find all the code points it matches.
> That list is then added to whatever else is in the character class, the
> inner pattern's space is freed, and compilation of the outer pattern
> resumed.  There is no recursive execution.  But there is recursion in
> the sense, as I described, that a second pattern is compiled while in
> the middle of compiling an outer pattern.  I don't know if that is an
> issue or not.  The patterns do not share anything, no groups, etc.
>

Please consider my objections withdrawn, sorry for the misunderstanding and
thank you for explaining.

Yves



> I've learned that a feature like this should be marked as experimental,
> so that it can be refined or even removed, and marking it as such lowers
> expectations as to its well-thought-outness and bug-free-ness.  It
> allows us to try things out and get feedback without having to say we
> think it is fully done.  The prototype is so marked.
>
> I've also learned that inefficiencies in compilation don't really
> matter.  I removed an entire pass of the regex compilation process, with
> extra mallocs being the price.  There did not seem to be a noticeable
> change in the speed of execution of our test suite!  This inefficient
> implementation (and I don't know another way to do it) won't be
> noticeable in the end, because it's only done at compilation.
>
> I believe PCRE doesn't do this; I don't know about other engines.  But
> if no one does, I would think that us having a feature no one else does
> is a selling point.  If others do, we could perhaps learn from their
> syntax.  A quick google search didn't turn up anything obvious.
>
> If there are issues with various constructs, we can forbid those.  My
> implementation, for example, doesn't allow braces in the subpattern, and
> hence no construct that requires braces.  I think that's a reasonable
> initial restriction to make it easier to implement something, that
> otherwise wouldn't get implemented.
>
> If the UTS 18 syntax is misleading, what isn't?
>
>
>

--000000000000f3b83e0581ac2fbd
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"auto"><div><br><br><div class=3D"gmail_quote"><div dir=3D"ltr" =
class=3D"gmail_attr">On Sun, 10 Feb 2019, 01:01 Karl Williamson, &lt;<a hre=
f=3D"mailto:public@khwilliamson.com" target=3D"_blank" rel=3D"noreferrer">p=
ublic@khwilliamson.com</a>&gt; wrote:<br></div><blockquote class=3D"gmail_q=
uote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1e=
x">On 2/9/19 2:29 AM, Deven T. Corzine wrote:<br>
&gt; On Sat, Feb 9, 2019 at 4:16 AM Deven T. Corzine &lt;<a href=3D"mailto:=
deven@ties.org" rel=3D"noreferrer noreferrer" target=3D"_blank">deven@ties.=
org</a> <br>
&gt; &lt;mailto:<a href=3D"mailto:deven@ties.org" rel=3D"noreferrer norefer=
rer" target=3D"_blank">deven@ties.org</a>&gt;&gt; wrote:<br>
&gt; <br>
&gt;=C2=A0 =C2=A0 =C2=A0Karl, can you enlighten us?=C2=A0 Are you recursing=
 into a subpattern at<br>
&gt;=C2=A0 =C2=A0 =C2=A0runtime? What do you think of the hypothetical appr=
oach I described?<br>
&gt; <br>
&gt; <br>
&gt; I just read Karl=E2=80=99s description again: =E2=80=9CThe way it&#39;=
s implemented is a <br>
&gt; separate regex is compiled and executed<br>
&gt; during the compilation of the outer one.=E2=80=9D<br>
&gt; <br>
&gt; I didn=E2=80=99t notice the =E2=80=9Cand executed=E2=80=9D part the fi=
rst time.=C2=A0 That sounds <br>
&gt; exactly like the hypothetical implementation that I described, actuall=
y...<br>
&gt; <br>
&gt; Deven<br>
&gt; <br>
<br>
I&#39;m sorry for not being clear.=C2=A0 Deven is correct that his hypothet=
ical <br>
implementation is what I have done.<br>
<br>
This is a bolt-on feature to the Perl&#39;s regexes.=C2=A0 It implements a =
<br>
portion of the wildcard feature of what UTS 18 asks for, using their <br>
syntax.=C2=A0 It is an apparent goal, as long listed in perlunicode, to do =
as <br>
much of UTS 18 as we can.<br>
<br>
And the implementation isn&#39;t efficient.<br>
<br>
It is implemented by, during the compilation of a character class, <br>
interrupting that compilation, assembling an inner pattern, then <br>
compiling that and executing it to find all the code points it matches. <br=
>
That list is then added to whatever else is in the character class, the <br=
>
inner pattern&#39;s space is freed, and compilation of the outer pattern <b=
r>
resumed.=C2=A0 There is no recursive execution.=C2=A0 But there is recursio=
n in <br>
the sense, as I described, that a second pattern is compiled while in <br>
the middle of compiling an outer pattern.=C2=A0 I don&#39;t know if that is=
 an <br>
issue or not.=C2=A0 The patterns do not share anything, no groups, etc.<br>=
</blockquote></div></div><div dir=3D"auto"><br></div><div dir=3D"auto">Plea=
se consider my objections withdrawn, sorry for the misunderstanding and tha=
nk you for explaining.</div><div dir=3D"auto"><br></div><div dir=3D"auto">Y=
ves</div><div dir=3D"auto"><br></div><div dir=3D"auto"><br></div><div dir=
=3D"auto"><div class=3D"gmail_quote"><blockquote class=3D"gmail_quote" styl=
e=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
I&#39;ve learned that a feature like this should be marked as experimental,=
 <br>
so that it can be refined or even removed, and marking it as such lowers <b=
r>
expectations as to its well-thought-outness and bug-free-ness.=C2=A0 It <br=
>
allows us to try things out and get feedback without having to say we <br>
think it is fully done.=C2=A0 The prototype is so marked.<br>
<br>
I&#39;ve also learned that inefficiencies in compilation don&#39;t really <=
br>
matter.=C2=A0 I removed an entire pass of the regex compilation process, wi=
th <br>
extra mallocs being the price.=C2=A0 There did not seem to be a noticeable =
<br>
change in the speed of execution of our test suite!=C2=A0 This inefficient =
<br>
implementation (and I don&#39;t know another way to do it) won&#39;t be <br=
>
noticeable in the end, because it&#39;s only done at compilation.<br>
<br>
I believe PCRE doesn&#39;t do this; I don&#39;t know about other engines.=
=C2=A0 But <br>
if no one does, I would think that us having a feature no one else does <br=
>
is a selling point.=C2=A0 If others do, we could perhaps learn from their <=
br>
syntax.=C2=A0 A quick google search didn&#39;t turn up anything obvious.<br=
>
<br>
If there are issues with various constructs, we can forbid those.=C2=A0 My =
<br>
implementation, for example, doesn&#39;t allow braces in the subpattern, an=
d <br>
hence no construct that requires braces.=C2=A0 I think that&#39;s a reasona=
ble <br>
initial restriction to make it easier to implement something, that <br>
otherwise wouldn&#39;t get implemented.<br>
<br>
If the UTS 18 syntax is misleading, what isn&#39;t?<br>
<br>
<br>
</blockquote></div></div></div>

--000000000000f3b83e0581ac2fbd--
0
demerphq
2/12/2019 6:03:29 AM
On 2/11/19 6:13 PM, Deven T. Corzine wrote:
> On Sat, Feb 9, 2019 at 12:01 PM Karl Williamson <public@khwilliamson.co=
m=20
> <mailto:public@khwilliamson.com>> wrote:
>=20
>     I'm sorry for not being clear.=C2=A0 Deven is correct that his hypo=
thetical
>     implementation is what I have done.
>=20
>=20
> That=E2=80=99s good to hear!=C2=A0 I was hoping it would be implemented=
 in such a=20
> fashion.
>=20
>     This is a bolt-on feature to the Perl's regexes.=C2=A0 It implement=
s a
>     portion of the wildcard feature of what UTS 18 asks for, using thei=
r
>     syntax.=C2=A0 It is an apparent goal, as long listed in perlunicode=
, to
>     do as
>     much of UTS 18 as we can.
>=20
>=20
> Using their syntax seems worthwhile even if we already deviate elsewher=
e.
>=20
> However, I must say that their last example in the table doesn't make=20
> sense to me
>=20
>     Characters in the Letterlike symbol block with different toLowercas=
e
>     values:
>      =C2=A0 =C2=A0 =C2=A0[\p{toLowercase=E2=89=A0@cp@} & \p{Block=3DLet=
terlike Symbols}]
>=20
>=20
> This seems to imply some sort of boolean logic, which sounds good in=20
> principle, but this syntax seems bizarre to me.=C2=A0 I would expect ea=
ch=20
> \p{...} expression to be independent, but if they want two different=20
> property matches to apply as a set intersection, I think one of these=20
> examples this would be a more reasonable syntax:
>=20
>  =C2=A0 =C2=A0 =C2=A0\p{toLowercase=E2=89=A0@cp@ & Block=3DLetterlike S=
ymbols}
> or
>  =C2=A0 =C2=A0 =C2=A0\p{{toLowercase=E2=89=A0@cp@} & {Block=3DLetterlik=
e Symbols}}

Well this implementation is just a start and doesn't include this=20
fancier stuff, so we can defer deciding this until later.
>=20
> Also, on the topic of syntax, since these are meant to be used for sets=
=20
> of characters that can be used in a character class, I would suggest=20
> that these \p{...} expressions should also work *outside* square=20
> brackets as well, and imply [\p{...}] if the square brackets are=20
> omitted.=C2=A0 (Perhaps you already do this too?)

Yes, already.
>=20
>     And the implementation isn't efficient.
>=20
>     It is implemented by, during the compilation of a character class,
>     interrupting that compilation, assembling an inner pattern, then
>     compiling that and executing it to find all the code points it matc=
hes.
>     That list is then added to whatever else is in the character class,=
 the
>     inner pattern's space is freed, and compilation of the outer patter=
n
>     resumed.=C2=A0 There is no recursive execution.=C2=A0 But there is =
recursion in
>     the sense, as I described, that a second pattern is compiled while =
in
>     the middle of compiling an outer pattern.=C2=A0 I don't know if tha=
t is an
>     issue or not.=C2=A0 The patterns do not share anything, no groups, =
etc.
>=20
>=20
> As long as it's all compile-time, it's probably plenty efficient enough=
=20
> already.=C2=A0 Still, it might be worth keeping a cache of the \p{...}=20
> expressions used and the set of Unicode characters each generated, to=20
> avoid incurring the cost of generating the set if the same expressions=20
> are used over and over again.=C2=A0 The cache could be discarded at the=
 end=20
> of the compilation phase, either for the one containing regex, or=20
> (perhaps better) after compiling the entire program.=C2=A0 Beyond that,=
 I'm=20
> not sure what else could be done to optimize it much more.

I don't think the added complexity is worth it at this stage of=20
development without real numbers to indicate that it is.  And since=20
eliminating a full pass of the compilation had no discernible effect, I=20
doubt that a cache would either.
>=20
>     I've learned that a feature like this should be marked as experimen=
tal,
>     so that it can be refined or even removed, and marking it as such
>     lowers
>     expectations as to its well-thought-outness and bug-free-ness.=C2=A0=
 It
>     allows us to try things out and get feedback without having to say =
we
>     think it is fully done.=C2=A0 The prototype is so marked.
>=20
>=20
> Good idea, especially since a later official Unicode standard could cha=
nge.
>=20
>     I've also learned that inefficiencies in compilation don't really
>     matter.=C2=A0 I removed an entire pass of the regex compilation pro=
cess,
>     with
>     extra mallocs being the price.=C2=A0 There did not seem to be a not=
iceable
>     change in the speed of execution of our test suite!=C2=A0 This inef=
ficient
>     implementation (and I don't know another way to do it) won't be
>     noticeable in the end, because it's only done at compilation.
>=20
>=20
> I would agree with this.=C2=A0 You're calling this implementation=20
> inefficient, but I'm not sure that word applies if there isn't a=20
> substantially better way to do it.=C2=A0 Creating a fixed character set=
 at=20
> compile time is the thing that will make this efficient at runtime, and=
=20
> as long as the cost at compile time is small, it's not likely to even b=
e=20
> noticed.
>=20
>     I believe PCRE doesn't do this; I don't know about other engines.=C2=
=A0 But
>     if no one does, I would think that us having a feature no one else =
does
>     is a selling point.=C2=A0 If others do, we could perhaps learn from=
 their
>     syntax.=C2=A0 A quick google search didn't turn up anything obvious=
..
>=20
>=20
> I doubt anyone else does it yet.=C2=A0 If Perl has it, perhaps PCRE wou=
ld=20
> consider copying it later to try to maintain better compatibility with=20
> Perl, but they might not even bother.
>=20
>     If there are issues with various constructs, we can forbid those.=C2=
=A0 My
>     implementation, for example, doesn't allow braces in the subpattern=
,
>     and
>     hence no construct that requires braces.=C2=A0 I think that's a rea=
sonable
>     initial restriction to make it easier to implement something, that
>     otherwise wouldn't get implemented.
>=20
>=20
> It would be good to support balanced/escaped braces, but that can=20
> certainly be a second pass...

What I'm trying to do is give people the ability to do something, while=20
punting niceties that aren't essential in favor of easier development.

>=20
>     If the UTS 18 syntax is misleading, what isn't?
>=20
>=20
> I'm not even sure what you mean by this!

I meant, if the reader doesn't like the syntax, make a different proposal=
..


In any event, I've pushed a new branch for people to play around with=20
that eliminates the anchoring, and allows for more delimiter characters=20
than the initial branch did.
>=20
> Deven
0
public
2/15/2019 5:25:40 PM
Top posted, the link to the branch you can try out is

https://perl5.git.perl.org/perl.git/shortlog/refs/heads/smoke-me/khw-core
On 2/15/19 10:25 AM, Karl Williamson wrote:
> On 2/11/19 6:13 PM, Deven T. Corzine wrote:
>> On Sat, Feb 9, 2019 at 12:01 PM Karl Williamson=20
>> <public@khwilliamson.com <mailto:public@khwilliamson.com>> wrote:
>>
>> =C2=A0=C2=A0=C2=A0 I'm sorry for not being clear.=C2=A0 Deven is corre=
ct that his=20
>> hypothetical
>> =C2=A0=C2=A0=C2=A0 implementation is what I have done.
>>
>>
>> That=E2=80=99s good to hear!=C2=A0 I was hoping it would be implemente=
d in such a=20
>> fashion.
>>
>> =C2=A0=C2=A0=C2=A0 This is a bolt-on feature to the Perl's regexes.=C2=
=A0 It implements a
>> =C2=A0=C2=A0=C2=A0 portion of the wildcard feature of what UTS 18 asks=
 for, using their
>> =C2=A0=C2=A0=C2=A0 syntax.=C2=A0 It is an apparent goal, as long liste=
d in perlunicode, to
>> =C2=A0=C2=A0=C2=A0 do as
>> =C2=A0=C2=A0=C2=A0 much of UTS 18 as we can.
>>
>>
>> Using their syntax seems worthwhile even if we already deviate elsewhe=
re.
>>
>> However, I must say that their last example in the table doesn't make=20
>> sense to me
>>
>> =C2=A0=C2=A0=C2=A0 Characters in the Letterlike symbol block with diff=
erent toLowercase
>> =C2=A0=C2=A0=C2=A0 values:
>> =C2=A0=C2=A0=C2=A0=C2=A0 =C2=A0 =C2=A0 =C2=A0[\p{toLowercase=E2=89=A0@=
cp@} & \p{Block=3DLetterlike Symbols}]
>>
>>
>> This seems to imply some sort of boolean logic, which sounds good in=20
>> principle, but this syntax seems bizarre to me.=C2=A0 I would expect e=
ach=20
>> \p{...} expression to be independent, but if they want two different=20
>> property matches to apply as a set intersection, I think one of these=20
>> examples this would be a more reasonable syntax:
>>
>> =C2=A0=C2=A0 =C2=A0 =C2=A0\p{toLowercase=E2=89=A0@cp@ & Block=3DLetter=
like Symbols}
>> or
>> =C2=A0=C2=A0 =C2=A0 =C2=A0\p{{toLowercase=E2=89=A0@cp@} & {Block=3DLet=
terlike Symbols}}
>=20
> Well this implementation is just a start and doesn't include this=20
> fancier stuff, so we can defer deciding this until later.
>>
>> Also, on the topic of syntax, since these are meant to be used for=20
>> sets of characters that can be used in a character class, I would=20
>> suggest that these \p{...} expressions should also work *outside*=20
>> square brackets as well, and imply [\p{...}] if the square brackets=20
>> are omitted.=C2=A0 (Perhaps you already do this too?)
>=20
> Yes, already.
>>
>> =C2=A0=C2=A0=C2=A0 And the implementation isn't efficient.
>>
>> =C2=A0=C2=A0=C2=A0 It is implemented by, during the compilation of a c=
haracter class,
>> =C2=A0=C2=A0=C2=A0 interrupting that compilation, assembling an inner =
pattern, then
>> =C2=A0=C2=A0=C2=A0 compiling that and executing it to find all the cod=
e points it=20
>> matches.
>> =C2=A0=C2=A0=C2=A0 That list is then added to whatever else is in the =
character=20
>> class, the
>> =C2=A0=C2=A0=C2=A0 inner pattern's space is freed, and compilation of =
the outer pattern
>> =C2=A0=C2=A0=C2=A0 resumed.=C2=A0 There is no recursive execution.=C2=A0=
 But there is recursion in
>> =C2=A0=C2=A0=C2=A0 the sense, as I described, that a second pattern is=
 compiled while in
>> =C2=A0=C2=A0=C2=A0 the middle of compiling an outer pattern.=C2=A0 I d=
on't know if that is an
>> =C2=A0=C2=A0=C2=A0 issue or not.=C2=A0 The patterns do not share anyth=
ing, no groups, etc.
>>
>>
>> As long as it's all compile-time, it's probably plenty efficient=20
>> enough already.=C2=A0 Still, it might be worth keeping a cache of the=20
>> \p{...} expressions used and the set of Unicode characters each=20
>> generated, to avoid incurring the cost of generating the set if the=20
>> same expressions are used over and over again.=C2=A0 The cache could b=
e=20
>> discarded at the end of the compilation phase, either for the one=20
>> containing regex, or (perhaps better) after compiling the entire=20
>> program.=C2=A0 Beyond that, I'm not sure what else could be done to=20
>> optimize it much more.
>=20
> I don't think the added complexity is worth it at this stage of=20
> development without real numbers to indicate that it is.=C2=A0 And sinc=
e=20
> eliminating a full pass of the compilation had no discernible effect, I=
=20
> doubt that a cache would either.
>>
>> =C2=A0=C2=A0=C2=A0 I've learned that a feature like this should be mar=
ked as=20
>> experimental,
>> =C2=A0=C2=A0=C2=A0 so that it can be refined or even removed, and mark=
ing it as such
>> =C2=A0=C2=A0=C2=A0 lowers
>> =C2=A0=C2=A0=C2=A0 expectations as to its well-thought-outness and bug=
-free-ness.=C2=A0 It
>> =C2=A0=C2=A0=C2=A0 allows us to try things out and get feedback withou=
t having to say we
>> =C2=A0=C2=A0=C2=A0 think it is fully done.=C2=A0 The prototype is so m=
arked.
>>
>>
>> Good idea, especially since a later official Unicode standard could=20
>> change.
>>
>> =C2=A0=C2=A0=C2=A0 I've also learned that inefficiencies in compilatio=
n don't really
>> =C2=A0=C2=A0=C2=A0 matter.=C2=A0 I removed an entire pass of the regex=
 compilation process,
>> =C2=A0=C2=A0=C2=A0 with
>> =C2=A0=C2=A0=C2=A0 extra mallocs being the price.=C2=A0 There did not =
seem to be a noticeable
>> =C2=A0=C2=A0=C2=A0 change in the speed of execution of our test suite!=
=C2=A0 This inefficient
>> =C2=A0=C2=A0=C2=A0 implementation (and I don't know another way to do =
it) won't be
>> =C2=A0=C2=A0=C2=A0 noticeable in the end, because it's only done at co=
mpilation.
>>
>>
>> I would agree with this.=C2=A0 You're calling this implementation=20
>> inefficient, but I'm not sure that word applies if there isn't a=20
>> substantially better way to do it.=C2=A0 Creating a fixed character se=
t at=20
>> compile time is the thing that will make this efficient at runtime,=20
>> and as long as the cost at compile time is small, it's not likely to=20
>> even be noticed.
>>
>> =C2=A0=C2=A0=C2=A0 I believe PCRE doesn't do this; I don't know about =
other engines. =20
>> But
>> =C2=A0=C2=A0=C2=A0 if no one does, I would think that us having a feat=
ure no one else=20
>> does
>> =C2=A0=C2=A0=C2=A0 is a selling point.=C2=A0 If others do, we could pe=
rhaps learn from their
>> =C2=A0=C2=A0=C2=A0 syntax.=C2=A0 A quick google search didn't turn up =
anything obvious.
>>
>>
>> I doubt anyone else does it yet.=C2=A0 If Perl has it, perhaps PCRE wo=
uld=20
>> consider copying it later to try to maintain better compatibility with=
=20
>> Perl, but they might not even bother.
>>
>> =C2=A0=C2=A0=C2=A0 If there are issues with various constructs, we can=
 forbid those.=C2=A0 My
>> =C2=A0=C2=A0=C2=A0 implementation, for example, doesn't allow braces i=
n the subpattern,
>> =C2=A0=C2=A0=C2=A0 and
>> =C2=A0=C2=A0=C2=A0 hence no construct that requires braces.=C2=A0 I th=
ink that's a reasonable
>> =C2=A0=C2=A0=C2=A0 initial restriction to make it easier to implement =
something, that
>> =C2=A0=C2=A0=C2=A0 otherwise wouldn't get implemented.
>>
>>
>> It would be good to support balanced/escaped braces, but that can=20
>> certainly be a second pass...
>=20
> What I'm trying to do is give people the ability to do something, while=
=20
> punting niceties that aren't essential in favor of easier development.
>=20
>>
>> =C2=A0=C2=A0=C2=A0 If the UTS 18 syntax is misleading, what isn't?
>>
>>
>> I'm not even sure what you mean by this!
>=20
> I meant, if the reader doesn't like the syntax, make a different propos=
al.
>=20
>=20
> In any event, I've pushed a new branch for people to play around with=20
> that eliminates the anchoring, and allows for more delimiter characters=
=20
> than the initial branch did.
>>
>> Deven
>=20
0
public
2/15/2019 5:52:20 PM
Reply: