RFC what about long regex EXACT nodes

Currently if there is a long string of text that is to be matched 
exactly (or under /i) that data is chunked into pieces of at most 256 
bytes.  The reason for this limit is that there happen to be 8 bits 
available.

But why not have a new node type for longer strings which wasn't limited 
to 256 bytes.  Is there a reason we haven't done this other than lack of 
tuits?

The advantages of such a node are less overhead when matching, as you 
can just keep going longer in the matching loop, and your memcmp for 
exact matches will be a single one rather than multiple.

I don't know if the optimizer currently strings such nodes together when 
computing the min and maximum lengths for strings to be able to match. 
It may be that it stops at 256.  If so this would improve the ability to 
avoid matching trivially if the criteria weren't met.

So is there a reason not to have this?
0
public
9/11/2019 5:20:01 AM
perl.perl5.porters 47781 articles. 1 followers. Follow

2 Replies
2 Views

Similar Articles

[PageSpeed] 21

--000000000000b18b8d0592439986
Content-Type: text/plain; charset="UTF-8"

On Wed, 11 Sep 2019, 07:20 Karl Williamson, <public@khwilliamson.com> wrote:

> Currently if there is a long string of text that is to be matched
> exactly (or under /i) that data is chunked into pieces of at most 256
> bytes.  The reason for this limit is that there happen to be 8 bits
> available.
>
> But why not have a new node type for longer strings which wasn't limited
> to 256 bytes.  Is there a reason we haven't done this other than lack of
> tuits?
>
> The advantages of such a node are less overhead when matching, as you
> can just keep going longer in the matching loop, and your memcmp for
> exact matches will be a single one rather than multiple.
>
> I don't know if the optimizer currently strings such nodes together when
> computing the min and maximum lengths for strings to be able to match.
> It may be that it stops at 256.  If so this would improve the ability to
> avoid matching trivially if the criteria weren't met.
>
> So is there a reason not to have this?
>

I think this sounds reasonable.

Yves

>

--000000000000b18b8d0592439986
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"auto"><div><br><br><div class=3D"gmail_quote"><div dir=3D"ltr" =
class=3D"gmail_attr">On Wed, 11 Sep 2019, 07:20 Karl Williamson, &lt;<a hre=
f=3D"mailto:public@khwilliamson.com">public@khwilliamson.com</a>&gt; wrote:=
<br></div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;bord=
er-left:1px #ccc solid;padding-left:1ex">Currently if there is a long strin=
g of text that is to be matched <br>
exactly (or under /i) that data is chunked into pieces of at most 256 <br>
bytes.=C2=A0 The reason for this limit is that there happen to be 8 bits <b=
r>
available.<br>
<br>
But why not have a new node type for longer strings which wasn&#39;t limite=
d <br>
to 256 bytes.=C2=A0 Is there a reason we haven&#39;t done this other than l=
ack of <br>
tuits?<br>
<br>
The advantages of such a node are less overhead when matching, as you <br>
can just keep going longer in the matching loop, and your memcmp for <br>
exact matches will be a single one rather than multiple.<br>
<br>
I don&#39;t know if the optimizer currently strings such nodes together whe=
n <br>
computing the min and maximum lengths for strings to be able to match. <br>
It may be that it stops at 256.=C2=A0 If so this would improve the ability =
to <br>
avoid matching trivially if the criteria weren&#39;t met.<br>
<br>
So is there a reason not to have this?<br></blockquote></div></div><div dir=
=3D"auto"><br></div><div dir=3D"auto">I think this sounds reasonable.</div>=
<div dir=3D"auto"><br></div><div dir=3D"auto">Yves</div><div dir=3D"auto"><=
div class=3D"gmail_quote"><blockquote class=3D"gmail_quote" style=3D"margin=
:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
</blockquote></div></div></div>

--000000000000b18b8d0592439986--
0
demerphq
9/11/2019 9:25:11 AM
On 9/11/19 3:25 AM, demerphq wrote:
>=20
>=20
> On Wed, 11 Sep 2019, 07:20 Karl Williamson, <public@khwilliamson.com=20
> <mailto:public@khwilliamson.com>> wrote:
>=20
>     Currently if there is a long string of text that is to be matched
>     exactly (or under /i) that data is chunked into pieces of at most 2=
56
>     bytes.=C2=A0 The reason for this limit is that there happen to be 8=
 bits
>     available.
>=20
>     But why not have a new node type for longer strings which wasn't
>     limited
>     to 256 bytes.=C2=A0 Is there a reason we haven't done this other th=
an
>     lack of
>     tuits?
>=20
>     The advantages of such a node are less overhead when matching, as y=
ou
>     can just keep going longer in the matching loop, and your memcmp fo=
r
>     exact matches will be a single one rather than multiple.
>=20
>     I don't know if the optimizer currently strings such nodes together
>     when
>     computing the min and maximum lengths for strings to be able to mat=
ch.
>     It may be that it stops at 256.=C2=A0 If so this would improve the
>     ability to
>     avoid matching trivially if the criteria weren't met.
>=20
>     So is there a reason not to have this?
>=20
>=20
> I think this sounds reasonable.
>=20
> Yves
>=20

As an example of uses of this, a biologist friend gave me this info:

"For me, the longest string I would ever use would be a whole=20
chromosome. In humans that ranges from 50 million -300 million letters=20
long (DNA bases). In some other organisms chromosomes can be billions of=20
bases long, I think the longest maybe 5 billion, estimated, but so far=20
the technology isn't there to accurately sequence and assemble those=20
super well....  If you want to put a whole genome into a single string,=20
that pushes the max to 130 billion bases, for Paris japonica, if you=20
combine all 40 chromosomes into one string...
Anyway, for me, I would say that 1 billion bases would be a good max=20
size, for most organisms, though a few go to 5-6, that I know of."

So we have a target string up to 130 billion bytes (way above rhe U32=20
capacity of 4.3 billion), but more likely the maximum is less than a=20
billion.

And they chunk the matching due to memory, etc limitations, but they=20
could have a pattern of length 100K, but usually just a few thousand.
0
public
9/13/2019 12:08:45 AM
Reply: