Why is REG_INFTY I16_MAX

It is possible to increase this number with a Configure option, but I 
was surprised to learn that the default is INT16_MAX.  This seems low to 
me.  What are the reasons for it being this value?
0
public
7/6/2018 4:32:03 AM
perl.perl5.porters 47444 articles. 0 followers. Follow

9 Replies
56 Views

Similar Articles

[PageSpeed] 55

On Thu, Jul 05, 2018 at 10:32:03PM -0600, Karl Williamson wrote:
> It is possible to increase this number with a Configure option, but I was
> surprised to learn that the default is INT16_MAX.  This seems low to me.
> What are the reasons for it being this value?

Historically I think it used to be like that because when the regex engine
used to be recursive, /Something_Complex{1,large_number}/ would blow the
stack.

The thing that put me off changing it myself was that it would presumably
increase the size of certain reg node types, and I'm not clear how reg
node sizes are handled (all that NODE_STEP_REGNODE stuff).

But it definitely needs increasing.


-- 
"Procrastination grows to fill the available time"
    -- Mitchell's corollary to Parkinson's Law
0
davem
7/6/2018 8:05:16 AM
On 07/06/2018 02:05 AM, Dave Mitchell wrote:
> On Thu, Jul 05, 2018 at 10:32:03PM -0600, Karl Williamson wrote:
>> It is possible to increase this number with a Configure option, but I was
>> surprised to learn that the default is INT16_MAX.  This seems low to me.
>> What are the reasons for it being this value?
> 
> Historically I think it used to be like that because when the regex engine
> used to be recursive, /Something_Complex{1,large_number}/ would blow the
> stack.
> 
> The thing that put me off changing it myself was that it would presumably
> increase the size of certain reg node types, and I'm not clear how reg
> node sizes are handled (all that NODE_STEP_REGNODE stuff).
> 
> But it definitely needs increasing.
> 
> 

We could change it to U16, doubling the size, without having to change 
any regnodes.
0
public
7/6/2018 5:11:28 PM
On 07/06/2018 11:11 AM, Karl Williamson wrote:
> On 07/06/2018 02:05 AM, Dave Mitchell wrote:
>> On Thu, Jul 05, 2018 at 10:32:03PM -0600, Karl Williamson wrote:
>>> It is possible to increase this number with a Configure option, but I=
=20
>>> was
>>> surprised to learn that the default is INT16_MAX.=C2=A0 This seems lo=
w to me.
>>> What are the reasons for it being this value?
>>
>> Historically I think it used to be like that because when the regex=20
>> engine
>> used to be recursive, /Something_Complex{1,large_number}/ would blow t=
he
>> stack.
>>
>> The thing that put me off changing it myself was that it would presuma=
bly
>> increase the size of certain reg node types, and I'm not clear how reg
>> node sizes are handled (all that NODE_STEP_REGNODE stuff).
>>
>> But it definitely needs increasing.
>>
>>
>=20
> We could change it to U16, doubling the size, without having to change=20
> any regnodes.

I tried this, and there were a few test failures that assumed the=20
existing size, but no real problems.

Currently, the regnodes that contain this use the regnode_2 structure,=20
which has room for two U16 args (min and max).  But there is a=20
regnode_2L that has a U32 arg and an I32 arg, which we could use=20
instead.  Or we could trivially create a regnode_2UL with two U32s, to=20
avoid casting

If there is agreement to do so, I can go ahead and do it.
0
public
7/6/2018 6:46:38 PM
On 07/06/2018 12:46 PM, Karl Williamson wrote:
> On 07/06/2018 11:11 AM, Karl Williamson wrote:
>> On 07/06/2018 02:05 AM, Dave Mitchell wrote:
>>> On Thu, Jul 05, 2018 at 10:32:03PM -0600, Karl Williamson wrote:
>>>> It is possible to increase this number with a Configure option, but=20
>>>> I was
>>>> surprised to learn that the default is INT16_MAX.=C2=A0 This seems l=
ow to=20
>>>> me.
>>>> What are the reasons for it being this value?
>>>
>>> Historically I think it used to be like that because when the regex=20
>>> engine
>>> used to be recursive, /Something_Complex{1,large_number}/ would blow =
the
>>> stack.
>>>
>>> The thing that put me off changing it myself was that it would=20
>>> presumably
>>> increase the size of certain reg node types, and I'm not clear how re=
g
>>> node sizes are handled (all that NODE_STEP_REGNODE stuff).
>>>
>>> But it definitely needs increasing.
>>>
>>>
>>
>> We could change it to U16, doubling the size, without having to change=
=20
>> any regnodes.
>=20
> I tried this, and there were a few test failures that assumed the=20
> existing size, but no real problems.
>=20
> Currently, the regnodes that contain this use the regnode_2 structure,=20
> which has room for two U16 args (min and max).=C2=A0 But there is a=20
> regnode_2L that has a U32 arg and an I32 arg, which we could use=20
> instead.=C2=A0 Or we could trivially create a regnode_2UL with two U32s=
, to=20
> avoid casting
>=20
> If there is agreement to do so, I can go ahead and do it.
>=20

I doubled the size, leaving for later any further enlargement.  We can=20
see what sorts of problems this small step causes.
0
public
10/2/2018 4:38:28 AM
On 10/1/18 10:38 PM, Karl Williamson wrote:
> On 07/06/2018 12:46 PM, Karl Williamson wrote:
>> On 07/06/2018 11:11 AM, Karl Williamson wrote:
>>> On 07/06/2018 02:05 AM, Dave Mitchell wrote:
>>>> On Thu, Jul 05, 2018 at 10:32:03PM -0600, Karl Williamson wrote:
>>>>> It is possible to increase this number with a Configure option, but=
=20
>>>>> I was
>>>>> surprised to learn that the default is INT16_MAX.=C2=A0 This seems =
low=20
>>>>> to me.
>>>>> What are the reasons for it being this value?
>>>>
>>>> Historically I think it used to be like that because when the regex=20
>>>> engine
>>>> used to be recursive, /Something_Complex{1,large_number}/ would blow=
=20
>>>> the
>>>> stack.
>>>>
>>>> The thing that put me off changing it myself was that it would=20
>>>> presumably
>>>> increase the size of certain reg node types, and I'm not clear how r=
eg
>>>> node sizes are handled (all that NODE_STEP_REGNODE stuff).
>>>>
>>>> But it definitely needs increasing.
>>>>
>>>>
>>>
>>> We could change it to U16, doubling the size, without having to=20
>>> change any regnodes.
>>
>> I tried this, and there were a few test failures that assumed the=20
>> existing size, but no real problems.
>>
>> Currently, the regnodes that contain this use the regnode_2 structure,=
=20
>> which has room for two U16 args (min and max).=C2=A0 But there is a=20
>> regnode_2L that has a U32 arg and an I32 arg, which we could use=20
>> instead.=C2=A0 Or we could trivially create a regnode_2UL with two U32=
s, to=20
>> avoid casting
>>
>> If there is agreement to do so, I can go ahead and do it.
>>
>=20
> I doubled the size, leaving for later any further enlargement.=C2=A0 We=
 can=20
> see what sorts of problems this small step causes.
>=20

As a cross reference,  XML::Easy fails as a result of this.
0
public
10/25/2018 3:02:36 AM
On 10/24/18 9:02 PM, Karl Williamson wrote:
> On 10/1/18 10:38 PM, Karl Williamson wrote:
>> On 07/06/2018 12:46 PM, Karl Williamson wrote:
>>> On 07/06/2018 11:11 AM, Karl Williamson wrote:
>>>> On 07/06/2018 02:05 AM, Dave Mitchell wrote:
>>>>> On Thu, Jul 05, 2018 at 10:32:03PM -0600, Karl Williamson wrote:
>>>>>> It is possible to increase this number with a Configure option,=20
>>>>>> but I was
>>>>>> surprised to learn that the default is INT16_MAX.=C2=A0 This seems=
 low=20
>>>>>> to me.
>>>>>> What are the reasons for it being this value?
>>>>>
>>>>> Historically I think it used to be like that because when the regex=
=20
>>>>> engine
>>>>> used to be recursive, /Something_Complex{1,large_number}/ would=20
>>>>> blow the
>>>>> stack.
>>>>>
>>>>> The thing that put me off changing it myself was that it would=20
>>>>> presumably
>>>>> increase the size of certain reg node types, and I'm not clear how =
reg
>>>>> node sizes are handled (all that NODE_STEP_REGNODE stuff).
>>>>>
>>>>> But it definitely needs increasing.
>>>>>
>>>>>
>>>>
>>>> We could change it to U16, doubling the size, without having to=20
>>>> change any regnodes.
>>>
>>> I tried this, and there were a few test failures that assumed the=20
>>> existing size, but no real problems.
>>>
>>> Currently, the regnodes that contain this use the regnode_2=20
>>> structure, which has room for two U16 args (min and max).=C2=A0 But t=
here=20
>>> is a regnode_2L that has a U32 arg and an I32 arg, which we could use=
=20
>>> instead.=C2=A0 Or we could trivially create a regnode_2UL with two U3=
2s,=20
>>> to avoid casting
>>>
>>> If there is agreement to do so, I can go ahead and do it.
>>>
>>
>> I doubled the size, leaving for later any further enlargement.=C2=A0 W=
e can=20
>> see what sorts of problems this small step causes.
>>
>=20
> As a cross reference,=C2=A0 XML::Easy fails as a result of this.
>=20

This topic came up again in b. d. foy's post

https://www.effectiveperlprogramming.com/2018/12/perl-v5-30-lets-you-matc=
h-more-with-the-general-quantifier/

In it, he talks about his experiments that showed that saying /a{1,}/=20
had a limit of U16_MAX, but /a+/ has a limit of I32_MAX.

I investigated, and the behavior stems from this commit (and possibly=20
other later commits changing things around)

commit faf11cac614129491d0258772ee4e6f8a3fb39e8
  Author: Hugo van der Sanden <hv@crypt.org>
  Date:   Mon Jul 1 14:28:04 2002 +0100

      Re: [ID 20020630.002] utf8 regex only matches 32k
      Message-Id: <200207011228.g61CS4T06766@crypt.compulink.co.uk>
      Date: Mon, 01 Jul 2002 13:28:04 +0100

      p4raw-id: //depot/perl@17390

The commit causes something that is REG_INFTY to be instead translated=20
into I32_MAX in one place in the code.

I believe the numbers should be consistent, and that likely we should=20
use the larger number in current use in the field as the minimum value=20
for infinity.

What I propose is to make 'n' in /a{1,n}/ be a maximum of I32_MAX - 1,=20
else it's an error.  If instead the pattern is written /a{1,}/ we would=20
use plain I32_MAX as the upper limit.  This would make REG_INFTY out of=20
bounds, and so we would know which the pattern was written as.  This=20
frees us from breaking if we later upped the REG_INFTY value to, say,=20
IV_MAX.

The reason not to go to U32_MAX and UV_MAX is that it looks to me like=20
we would have problems with various expressions potentially overflowing.

Making this change would require changing the regnodes that represent=20
the {m,n} constructs to use 32 bit arguments, instead of 16.
0
public
12/11/2018 3:16:52 AM
On 12/10/18 8:16 PM, Karl Williamson wrote:
> On 10/24/18 9:02 PM, Karl Williamson wrote:
>> On 10/1/18 10:38 PM, Karl Williamson wrote:
>>> On 07/06/2018 12:46 PM, Karl Williamson wrote:
>>>> On 07/06/2018 11:11 AM, Karl Williamson wrote:
>>>>> On 07/06/2018 02:05 AM, Dave Mitchell wrote:
>>>>>> On Thu, Jul 05, 2018 at 10:32:03PM -0600, Karl Williamson wrote:
>>>>>>> It is possible to increase this number with a Configure option,=20
>>>>>>> but I was
>>>>>>> surprised to learn that the default is INT16_MAX.=C2=A0 This seem=
s low=20
>>>>>>> to me.
>>>>>>> What are the reasons for it being this value?
>>>>>>
>>>>>> Historically I think it used to be like that because when the=20
>>>>>> regex engine
>>>>>> used to be recursive, /Something_Complex{1,large_number}/ would=20
>>>>>> blow the
>>>>>> stack.
>>>>>>
>>>>>> The thing that put me off changing it myself was that it would=20
>>>>>> presumably
>>>>>> increase the size of certain reg node types, and I'm not clear how=
=20
>>>>>> reg
>>>>>> node sizes are handled (all that NODE_STEP_REGNODE stuff).
>>>>>>
>>>>>> But it definitely needs increasing.
>>>>>>
>>>>>>
>>>>>
>>>>> We could change it to U16, doubling the size, without having to=20
>>>>> change any regnodes.
>>>>
>>>> I tried this, and there were a few test failures that assumed the=20
>>>> existing size, but no real problems.
>>>>
>>>> Currently, the regnodes that contain this use the regnode_2=20
>>>> structure, which has room for two U16 args (min and max).=C2=A0 But =
there=20
>>>> is a regnode_2L that has a U32 arg and an I32 arg, which we could=20
>>>> use instead.=C2=A0 Or we could trivially create a regnode_2UL with t=
wo=20
>>>> U32s, to avoid casting
>>>>
>>>> If there is agreement to do so, I can go ahead and do it.
>>>>
>>>
>>> I doubled the size, leaving for later any further enlargement.=C2=A0 =
We=20
>>> can see what sorts of problems this small step causes.
>>>
>>
>> As a cross reference,=C2=A0 XML::Easy fails as a result of this.
>>
>=20
> This topic came up again in b. d. foy's post
>=20
> https://www.effectiveperlprogramming.com/2018/12/perl-v5-30-lets-you-ma=
tch-more-with-the-general-quantifier/=20
>=20
>=20
> In it, he talks about his experiments that showed that saying /a{1,}/=20
> had a limit of U16_MAX, but /a+/ has a limit of I32_MAX.
>=20
> I investigated, and the behavior stems from this commit (and possibly=20
> other later commits changing things around)
>=20
> commit faf11cac614129491d0258772ee4e6f8a3fb39e8
>  =C2=A0Author: Hugo van der Sanden <hv@crypt.org>
>  =C2=A0Date:=C2=A0=C2=A0 Mon Jul 1 14:28:04 2002 +0100
>=20
>  =C2=A0=C2=A0=C2=A0=C2=A0 Re: [ID 20020630.002] utf8 regex only matches=
 32k
>  =C2=A0=C2=A0=C2=A0=C2=A0 Message-Id: <200207011228.g61CS4T06766@crypt.=
compulink.co.uk>
>  =C2=A0=C2=A0=C2=A0=C2=A0 Date: Mon, 01 Jul 2002 13:28:04 +0100
>=20
>  =C2=A0=C2=A0=C2=A0=C2=A0 p4raw-id: //depot/perl@17390
>=20
> The commit causes something that is REG_INFTY to be instead translated=20
> into I32_MAX in one place in the code.
>=20
> I believe the numbers should be consistent, and that likely we should=20
> use the larger number in current use in the field as the minimum value=20
> for infinity.
>=20
> What I propose is to make 'n' in /a{1,n}/ be a maximum of I32_MAX - 1,=20
> else it's an error.=C2=A0 If instead the pattern is written /a{1,}/ we =
would=20
> use plain I32_MAX as the upper limit.=C2=A0 This would make REG_INFTY o=
ut of=20
> bounds, and so we would know which the pattern was written as.=C2=A0 Th=
is=20
> frees us from breaking if we later upped the REG_INFTY value to, say,=20
> IV_MAX.
>=20
> The reason not to go to U32_MAX and UV_MAX is that it looks to me like=20
> we would have problems with various expressions potentially overflowing=
..
>=20
> Making this change would require changing the regnodes that represent=20
> the {m,n} constructs to use 32 bit arguments, instead of 16.
>=20

And the reason for Hugo's commit was this:
https://www.nntp.perl.org/group/perl.perl5.porters/2018/10/msg252614.html
0
public
12/11/2018 3:33:33 AM
Karl Williamson <public@khwilliamson.com> wrote:
:> I investigated, and the behavior stems from this commit (and possibly 
:> other later commits changing things around)
:> 
:> commit faf11cac614129491d0258772ee4e6f8a3fb39e8
:>   Author: Hugo van der Sanden <hv@crypt.org>
:>   Date:   Mon Jul 1 14:28:04 2002 +0100
:>   Re: [ID 20020630.002] utf8 regex only matches 32k
:>   Message-Id: <200207011228.g61CS4T06766@crypt.compulink.co.uk>
:>   Date: Mon, 01 Jul 2002 13:28:04 +0100
[...]
:And the reason for Hugo's commit was this:
:https://www.nntp.perl.org/group/perl.perl5.porters/2018/10/msg252614.html

That'd be this one, I think:
  https://www.nntp.perl.org/group/perl.perl5.porters/2002/07/msg62591.html

Hugo
0
hv
12/11/2018 10:45:08 AM
On 12/11/18 3:45 AM, hv@crypt.org wrote:
> Karl Williamson <public@khwilliamson.com> wrote:
> :> I investigated, and the behavior stems from this commit (and possibl=
y
> :> other later commits changing things around)
> :>
> :> commit faf11cac614129491d0258772ee4e6f8a3fb39e8
> :>   Author: Hugo van der Sanden <hv@crypt.org>
> :>   Date:   Mon Jul 1 14:28:04 2002 +0100
> :>   Re: [ID 20020630.002] utf8 regex only matches 32k
> :>   Message-Id: <200207011228.g61CS4T06766@crypt.compulink.co.uk>
> :>   Date: Mon, 01 Jul 2002 13:28:04 +0100
> [...]
> :And the reason for Hugo's commit was this:
> :https://www.nntp.perl.org/group/perl.perl5.porters/2018/10/msg252614.h=
tml
>=20
> That'd be this one, I think:
>    https://www.nntp.perl.org/group/perl.perl5.porters/2002/07/msg62591.=
html
>=20
> Hugo
>=20

Further investigation by me leads me to think that the implementation we=20
have currently is fine.  The perldelta for my change needs to be=20
corrected, and I'll check if there are adequate tests.  It turns out=20
that the implementation is already what I had proposed, except that the=20
upper limit specifiable is 16 bits wide.  I no longer see any point to=20
enlarging that at this time, as I say in the following comment on=20
brian's page:

I was the one who made the change and wrote the perldelta. It turns out=20
that the change is valid, I believe, but the perldelta is wrong due to=20
my incomplete understanding of how things work. It should have read=20
something like

=E2=80=9CThe upper limit =E2=80=98n=E2=80=99 specifiable in a regular exp=
ression quantifier of=20
the form =E2=80=98{m,n}=E2=80=99 has been doubled to 65534.

=E2=80=9CThe meaning of an unbounded upper quantifier =E2=80=98{m,}=E2=80=
=99 remains unchanged.=20
It matches at least 2**31 =E2=80=93 1 times.=E2=80=9D

Some platforms (maybe not current ones; I don=E2=80=99t know, I don=E2=80=
=99t keep=20
track) make every number 64 bits wide. It=E2=80=99s on those that the unb=
ounded=20
limit could be larger than 2**31. But Perl takes special care to limit=20
=E2=80=98n=E2=80=99 to 65534 max, even on such platforms.

The bottom line is that the upper limit you can specify is much lower=20
than the infinity that perl uses internally. That could be changed=20
fairly easily, but no one has, to my knowledge, ever complained that=20
it=E2=80=99s too low, and we just doubled it anyway.

You can say =E2=80=98use re qw(Debug ALL)=E2=80=99 before a regular expre=
ssion you are=20
curious about to see in more detail what is happening. If you do so=20
around a quantifier =E2=80=9C{1,}=E2=80=9D, you=E2=80=99ll see that what =
gets generated is=20
identical to what gets generated if you had instead said =E2=80=9C+=E2=80=
=9D.

And, BTW, the reason

$ perl -c -e =E2=80=98/.{0,-1}/=E2=80=99

passes a syntax check is that it is legal; it just doesn=E2=80=99t match =
what=20
probably was intended. Again if you say =E2=80=98use re qw(Debug ALL)=E2=80=
=99, you can=20
see what=E2=80=99s going on.

Final program:
1: REG_ANY (2)
2: EXACT <{0,-1}> (5)
5: END (0)

REG_ANY matches any single character except newline. That=E2=80=99s the d=
ot in=20
the input. Then the exact string =E2=80=98{0-1}=E2=80=99 must be matched.=
 The warning=20
message no longer says (in 5.29) that this will be illegal in 5.30. It=20
will remain legal, and the warning message will remain. Pay attention to=20
it. It=E2=80=99s telling you that the left brace is to be matched literal=
ly,=20
which also implies that what follows isn=E2=80=99t going to be a quantifi=
er.=20
What will be illegal in 5.30 are just the constructs that we intend to=20
change the meaning of. This is to limit the potential breakage of=20
existing code, while still warning that what they might have thought=20
they were getting is wrong. It allows _autoconf_ to not change, for=20
example. (As a head=E2=80=99s up, after 5.30, we can relax the syntax to =
allow=20
some spaces and to be able to say =E2=80=9C{,n}=E2=80=9D. Currently the l=
ower limit must=20
be specified. It will also allow us to extend various escape sequences,=20
so that \w{foo} could mean the =E2=80=9Cfoo=E2=80=9D specialization of \w=
, for whatever=20
specializations we come up with in the future.)
0
public
12/11/2018 3:55:12 PM
Reply: