RFC what about long regex EXACT nodes

Currently if there is a long string of text that is to be matched 
exactly (or under /i) that data is chunked into pieces of at most 256 
bytes.  The reason for this limit is that there happen to be 8 bits 
available.

But why not have a new node type for longer strings which wasn't limited 
to 256 bytes.  Is there a reason we haven't done this other than lack of 
tuits?

The advantages of such a node are less overhead when matching, as you 
can just keep going longer in the matching loop, and your memcmp for 
exact matches will be a single one rather than multiple.

I don't know if the optimizer currently strings such nodes together when 
computing the min and maximum lengths for strings to be able to match. 
It may be that it stops at 256.  If so this would improve the ability to 
avoid matching trivially if the criteria weren't met.

So is there a reason not to have this?
0
public
9/11/2019 5:20:01 AM
perl.perl5.porters 47902 articles. 1 followers. Follow

4 Replies
43 Views

Similar Articles

[PageSpeed] 46

--000000000000b18b8d0592439986
Content-Type: text/plain; charset="UTF-8"

On Wed, 11 Sep 2019, 07:20 Karl Williamson, <public@khwilliamson.com> wrote:

> Currently if there is a long string of text that is to be matched
> exactly (or under /i) that data is chunked into pieces of at most 256
> bytes.  The reason for this limit is that there happen to be 8 bits
> available.
>
> But why not have a new node type for longer strings which wasn't limited
> to 256 bytes.  Is there a reason we haven't done this other than lack of
> tuits?
>
> The advantages of such a node are less overhead when matching, as you
> can just keep going longer in the matching loop, and your memcmp for
> exact matches will be a single one rather than multiple.
>
> I don't know if the optimizer currently strings such nodes together when
> computing the min and maximum lengths for strings to be able to match.
> It may be that it stops at 256.  If so this would improve the ability to
> avoid matching trivially if the criteria weren't met.
>
> So is there a reason not to have this?
>

I think this sounds reasonable.

Yves

>

--000000000000b18b8d0592439986
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"auto"><div><br><br><div class=3D"gmail_quote"><div dir=3D"ltr" =
class=3D"gmail_attr">On Wed, 11 Sep 2019, 07:20 Karl Williamson, &lt;<a hre=
f=3D"mailto:public@khwilliamson.com">public@khwilliamson.com</a>&gt; wrote:=
<br></div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;bord=
er-left:1px #ccc solid;padding-left:1ex">Currently if there is a long strin=
g of text that is to be matched <br>
exactly (or under /i) that data is chunked into pieces of at most 256 <br>
bytes.=C2=A0 The reason for this limit is that there happen to be 8 bits <b=
r>
available.<br>
<br>
But why not have a new node type for longer strings which wasn&#39;t limite=
d <br>
to 256 bytes.=C2=A0 Is there a reason we haven&#39;t done this other than l=
ack of <br>
tuits?<br>
<br>
The advantages of such a node are less overhead when matching, as you <br>
can just keep going longer in the matching loop, and your memcmp for <br>
exact matches will be a single one rather than multiple.<br>
<br>
I don&#39;t know if the optimizer currently strings such nodes together whe=
n <br>
computing the min and maximum lengths for strings to be able to match. <br>
It may be that it stops at 256.=C2=A0 If so this would improve the ability =
to <br>
avoid matching trivially if the criteria weren&#39;t met.<br>
<br>
So is there a reason not to have this?<br></blockquote></div></div><div dir=
=3D"auto"><br></div><div dir=3D"auto">I think this sounds reasonable.</div>=
<div dir=3D"auto"><br></div><div dir=3D"auto">Yves</div><div dir=3D"auto"><=
div class=3D"gmail_quote"><blockquote class=3D"gmail_quote" style=3D"margin=
:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
</blockquote></div></div></div>

--000000000000b18b8d0592439986--
0
demerphq
9/11/2019 9:25:11 AM
On 9/11/19 3:25 AM, demerphq wrote:
>=20
>=20
> On Wed, 11 Sep 2019, 07:20 Karl Williamson, <public@khwilliamson.com=20
> <mailto:public@khwilliamson.com>> wrote:
>=20
>     Currently if there is a long string of text that is to be matched
>     exactly (or under /i) that data is chunked into pieces of at most 2=
56
>     bytes.=C2=A0 The reason for this limit is that there happen to be 8=
 bits
>     available.
>=20
>     But why not have a new node type for longer strings which wasn't
>     limited
>     to 256 bytes.=C2=A0 Is there a reason we haven't done this other th=
an
>     lack of
>     tuits?
>=20
>     The advantages of such a node are less overhead when matching, as y=
ou
>     can just keep going longer in the matching loop, and your memcmp fo=
r
>     exact matches will be a single one rather than multiple.
>=20
>     I don't know if the optimizer currently strings such nodes together
>     when
>     computing the min and maximum lengths for strings to be able to mat=
ch.
>     It may be that it stops at 256.=C2=A0 If so this would improve the
>     ability to
>     avoid matching trivially if the criteria weren't met.
>=20
>     So is there a reason not to have this?
>=20
>=20
> I think this sounds reasonable.
>=20
> Yves
>=20

As an example of uses of this, a biologist friend gave me this info:

"For me, the longest string I would ever use would be a whole=20
chromosome. In humans that ranges from 50 million -300 million letters=20
long (DNA bases). In some other organisms chromosomes can be billions of=20
bases long, I think the longest maybe 5 billion, estimated, but so far=20
the technology isn't there to accurately sequence and assemble those=20
super well....  If you want to put a whole genome into a single string,=20
that pushes the max to 130 billion bases, for Paris japonica, if you=20
combine all 40 chromosomes into one string...
Anyway, for me, I would say that 1 billion bases would be a good max=20
size, for most organisms, though a few go to 5-6, that I know of."

So we have a target string up to 130 billion bytes (way above rhe U32=20
capacity of 4.3 billion), but more likely the maximum is less than a=20
billion.

And they chunk the matching due to memory, etc limitations, but they=20
could have a pattern of length 100K, but usually just a few thousand.
0
public
9/13/2019 12:08:45 AM
On 9/12/19 6:08 PM, Karl Williamson wrote:
> On 9/11/19 3:25 AM, demerphq wrote:
>>
>>
>> On Wed, 11 Sep 2019, 07:20 Karl Williamson, <public@khwilliamson.com=20
>> <mailto:public@khwilliamson.com>> wrote:
>>
>> =C2=A0=C2=A0=C2=A0 Currently if there is a long string of text that is=
 to be matched
>> =C2=A0=C2=A0=C2=A0 exactly (or under /i) that data is chunked into pie=
ces of at most 256
>> =C2=A0=C2=A0=C2=A0 bytes.=C2=A0 The reason for this limit is that ther=
e happen to be 8 bits
>> =C2=A0=C2=A0=C2=A0 available.
>>
>> =C2=A0=C2=A0=C2=A0 But why not have a new node type for longer strings=
 which wasn't
>> =C2=A0=C2=A0=C2=A0 limited
>> =C2=A0=C2=A0=C2=A0 to 256 bytes.=C2=A0 Is there a reason we haven't do=
ne this other than
>> =C2=A0=C2=A0=C2=A0 lack of
>> =C2=A0=C2=A0=C2=A0 tuits?
>>
>> =C2=A0=C2=A0=C2=A0 The advantages of such a node are less overhead whe=
n matching, as you
>> =C2=A0=C2=A0=C2=A0 can just keep going longer in the matching loop, an=
d your memcmp for
>> =C2=A0=C2=A0=C2=A0 exact matches will be a single one rather than mult=
iple.
>>
>> =C2=A0=C2=A0=C2=A0 I don't know if the optimizer currently strings suc=
h nodes together
>> =C2=A0=C2=A0=C2=A0 when
>> =C2=A0=C2=A0=C2=A0 computing the min and maximum lengths for strings t=
o be able to=20
>> match.
>> =C2=A0=C2=A0=C2=A0 It may be that it stops at 256.=C2=A0 If so this wo=
uld improve the
>> =C2=A0=C2=A0=C2=A0 ability to
>> =C2=A0=C2=A0=C2=A0 avoid matching trivially if the criteria weren't me=
t.
>>
>> =C2=A0=C2=A0=C2=A0 So is there a reason not to have this?
>>
>>
>> I think this sounds reasonable.
>>
>> Yves
>>
>=20
> As an example of uses of this, a biologist friend gave me this info:
>=20
> "For me, the longest string I would ever use would be a whole=20
> chromosome. In humans that ranges from 50 million -300 million letters=20
> long (DNA bases). In some other organisms chromosomes can be billions o=
f=20
> bases long, I think the longest maybe 5 billion, estimated, but so far=20
> the technology isn't there to accurately sequence and assemble those=20
> super well....=C2=A0 If you want to put a whole genome into a single st=
ring,=20
> that pushes the max to 130 billion bases, for Paris japonica, if you=20
> combine all 40 chromosomes into one string...
> Anyway, for me, I would say that 1 billion bases would be a good max=20
> size, for most organisms, though a few go to 5-6, that I know of."
>=20
> So we have a target string up to 130 billion bytes (way above rhe U32=20
> capacity of 4.3 billion), but more likely the maximum is less than a=20
> billion.
>=20
> And they chunk the matching due to memory, etc limitations, but they=20
> could have a pattern of length 100K, but usually just a few thousand.
>=20

Now done by these two commits

  commit 3363c7035ff1df0c3ffeae0cd18bb86cc39d62e4
  Author: Karl Williamson <khw@cpan.org>
  Date:   Thu Sep 26 21:38:46 2019 -0600

      regex: Create and handle LEXACT nodes

      See the previous commit for info on these.

      I am not changing trie code to recognize these at this time.

  commit ae06e581c6e9944620eed4980fe89a3749886ed0
  Author: Karl Williamson <khw@cpan.org>
  Date:   Wed Sep 25 10:12:32 2019 -0600

      Add regnode LEXACT, for long strings

      This commit adds a new regnode for strings that don't fit in a regu=
lar
      one, and adds a structure for that regnode to use.  Actually using=20
them
      is deferred to the next commit.

      This new regnode structure is needed because the previous=20
structure only
      allows for an 8 bit length field, 255 max bytes.  This commit puts =
the
      length instead in a new field, the same place single-argument regno=
des
      put their argument.  Hence this long string is an extra 32 bits of
      overhead, but at no string length is this node ever bigger than the
      combination of the smaller nodes it replaces.

      I also considered simply combining the original 8 bit length field
      (which is now unused) with the first byte of the string field to ge=
t a
      16 bit length, and have the actual string be offset by 1.  But I
      rejected that because it would mean the string would usually not be
      aligned, slowing down memory accesses.

      This new LEXACT regnode can hold up to what 1024 regular EXACT=20
ones hold,
      using 4K fewer overhead bytes to do so.  That means it can handle
      strings containing 262000 bytes.  The comments give ideas for=20
expanding
      that should it become necessary or desirable.

      Besides the space advantage, any hardware acceleration in memcmp
      can be done in much bigger chunks, and otherwise the memcmp inner l=
oop
      (often written in assembly) will run many more times in a row, and =
our
      outer loop that calls it, correspondingly fewer.
0
public
9/29/2019 5:48:20 PM
This is impressive, Karl!


Do you think there's a benchmark we can run to see the benefits this
could yield?


On 9/29/19 8:48 PM, Karl Williamson wrote:
> On 9/12/19 6:08 PM, Karl Williamson wrote:
>> On 9/11/19 3:25 AM, demerphq wrote:
>>>
>>>
>>> On Wed, 11 Sep 2019, 07:20 Karl Williamson, <public@khwilliamson.com
>>> <mailto:public@khwilliamson.com>> wrote:
>>>
>>>     Currently if there is a long string of text that is to be matched
>>>     exactly (or under /i) that data is chunked into pieces of at
>>> most 256
>>>     bytes.  The reason for this limit is that there happen to be 8 bits
>>>     available.
>>>
>>>     But why not have a new node type for longer strings which wasn't
>>>     limited
>>>     to 256 bytes.  Is there a reason we haven't done this other than
>>>     lack of
>>>     tuits?
>>>
>>>     The advantages of such a node are less overhead when matching,
>>> as you
>>>     can just keep going longer in the matching loop, and your memcmp
>>> for
>>>     exact matches will be a single one rather than multiple.
>>>
>>>     I don't know if the optimizer currently strings such nodes together
>>>     when
>>>     computing the min and maximum lengths for strings to be able to
>>> match.
>>>     It may be that it stops at 256.  If so this would improve the
>>>     ability to
>>>     avoid matching trivially if the criteria weren't met.
>>>
>>>     So is there a reason not to have this?
>>>
>>>
>>> I think this sounds reasonable.
>>>
>>> Yves
>>>
>>
>> As an example of uses of this, a biologist friend gave me this info:
>>
>> "For me, the longest string I would ever use would be a whole
>> chromosome. In humans that ranges from 50 million -300 million
>> letters long (DNA bases). In some other organisms chromosomes can be
>> billions of bases long, I think the longest maybe 5 billion,
>> estimated, but so far the technology isn't there to accurately
>> sequence and assemble those super well....  If you want to put a
>> whole genome into a single string, that pushes the max to 130 billion
>> bases, for Paris japonica, if you combine all 40 chromosomes into one
>> string...
>> Anyway, for me, I would say that 1 billion bases would be a good max
>> size, for most organisms, though a few go to 5-6, that I know of."
>>
>> So we have a target string up to 130 billion bytes (way above rhe U32
>> capacity of 4.3 billion), but more likely the maximum is less than a
>> billion.
>>
>> And they chunk the matching due to memory, etc limitations, but they
>> could have a pattern of length 100K, but usually just a few thousand.
>>
>
> Now done by these two commits
>
>  commit 3363c7035ff1df0c3ffeae0cd18bb86cc39d62e4
>  Author: Karl Williamson <khw@cpan.org>
>  Date:   Thu Sep 26 21:38:46 2019 -0600
>
>      regex: Create and handle LEXACT nodes
>
>      See the previous commit for info on these.
>
>      I am not changing trie code to recognize these at this time.
>
>  commit ae06e581c6e9944620eed4980fe89a3749886ed0
>  Author: Karl Williamson <khw@cpan.org>
>  Date:   Wed Sep 25 10:12:32 2019 -0600
>
>      Add regnode LEXACT, for long strings
>
>      This commit adds a new regnode for strings that don't fit in a
> regular
>      one, and adds a structure for that regnode to use.  Actually
> using them
>      is deferred to the next commit.
>
>      This new regnode structure is needed because the previous
> structure only
>      allows for an 8 bit length field, 255 max bytes.  This commit
> puts the
>      length instead in a new field, the same place single-argument
> regnodes
>      put their argument.  Hence this long string is an extra 32 bits of
>      overhead, but at no string length is this node ever bigger than the
>      combination of the smaller nodes it replaces.
>
>      I also considered simply combining the original 8 bit length field
>      (which is now unused) with the first byte of the string field to
> get a
>      16 bit length, and have the actual string be offset by 1.  But I
>      rejected that because it would mean the string would usually not be
>      aligned, slowing down memory accesses.
>
>      This new LEXACT regnode can hold up to what 1024 regular EXACT
> ones hold,
>      using 4K fewer overhead bytes to do so.  That means it can handle
>      strings containing 262000 bytes.  The comments give ideas for
> expanding
>      that should it become necessary or desirable.
>
>      Besides the space advantage, any hardware acceleration in memcmp
>      can be done in much bigger chunks, and otherwise the memcmp inner
> loop
>      (often written in assembly) will run many more times in a row,
> and our
>      outer loop that calls it, correspondingly fewer.
0
xsawyerx
9/29/2019 8:22:42 PM
Reply: