Matching subpatterns in any order, conjunctions, negated matches

Regex engines by their nature care a lot about order, but I
occasionally want to relax that to match for multiple
multicharacter subpatterns where the order of them doesn't
matter.

Frequently the simplest thing to do is just to just do multiple
matches.   Let's say you're looking for words that have a "qu" a
"th" and also, say an "ea".  This works:

  my $DICT  = "/usr/share/dict/american-english";
  my @hits = $DICT.IO.open( :r ).lines.grep({/qu/}).grep({/th/}).grep({/ea/});
  say @hits;
  # [bequeath bequeathed bequeathing bequeaths earthquake earthquake's
earthquakes]


It could be useful to be able to do it as one match though, for
example, you might be using someone else's routine which takes a
single regex as argument.  I've been known to write things like
this:

  my regex qu_th_ea   {  [ qu .*? th .*? ea ] |
                         [ qu .*? ea .*? th ] |
                         [ th .*? qu .*? ea ] |
                         [ th .*? ea .*? qu ] |
                         [ ea .*? th .*? qu ] |
                         [ ea .*? qu .*? th ]  };
  my @hits = $DICT.IO.open( :r ).lines.grep({/<qu_th_ea>/});

That works, but it gets unwieldy quickly if you need to scale up
the number of subpatterns.

Recently though, I noticed the "conjunctions" feature, and it
occured to me that this could be a very neat way of handling
these things:

  my regex qu_th_ea { ^ [ .* qu .* & .* th .* & .* ea .* ] $ };

That's certainly much better, though unfortunately each element
of the conjunction needs to match a substring of the same length,
so pretty frequently you're stuck with the visual noise of
bracketing subpatterns with pairs of .*

Where things get interesting is when you want a negated match of
one of the subpatterns.  One of the things I like about the first
approach using multiple chained greps is that it's easy to do a
reverse match.  What if you want words with "qu" and "th" but
want to *skip* ones with an "ea"?

  my @hits = $DICT.IO.open( :r ).lines.grep({/qu/}).grep({/th/}).grep({!/ea/});
  # [Asquith discotheque discotheque's discotheques quoth]

To do that in one regex, it would be nice if there were some sort
of adverb to do a reverse match, like say :not, then it
would be straight-forward (NOTE: NON-WORKING CODE):

  my regex qu_th_ea { ^ [ .* qu .* & .* th .* &  [ :not .* ea .* ] ] $ };

But since there isn't an adverb like this, what else might we do?
The best idea I can come up with is this:

  my regex qu_th_ea { ^ [ .* qu .* & .* th .* &  [ <!after ea> . ]*  ] $ };

Where the third element of the conjunction should match only if
none of the characters follow "ea".  There's an oddity here
though in that I think this can get confused by things like an
"ea" that *precedes* the conjunction.

So, the question then is: is there a neater way to embed a
subpattern in a regex that does a negated match?
0
doomvox
5/16/2020 2:32:50 AM
perl.perl6.users 1504 articles. 0 followers. Follow

6 Replies
66 Views

Similar Articles

[PageSpeed] 23

--r+TdwuXy+OXS8TUs
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Fri, May 15, 2020 at 07:32:50PM -0700, Joseph Brenner wrote:
> Regex engines by their nature care a lot about order, but I
> occasionally want to relax that to match for multiple
> multicharacter subpatterns where the order of them doesn't
> matter.
>=20
> Frequently the simplest thing to do is just to just do multiple
> matches.   Let's say you're looking for words that have a "qu" a
> "th" and also, say an "ea".  This works:
>=20
>   my $DICT  =3D "/usr/share/dict/american-english";
>   my @hits =3D $DICT.IO.open( :r ).lines.grep({/qu/}).grep({/th/}).grep({=
/ea/});
>   say @hits;
>   # [bequeath bequeathed bequeathing bequeaths earthquake earthquake's
> earthquakes]

Would something like this work for you?

  /^ <?before .* "qu" > <?before .* "th" > <?before .* "ea" > /

> Where things get interesting is when you want a negated match of
> one of the subpatterns.  One of the things I like about the first
> approach using multiple chained greps is that it's easy to do a
> reverse match.  What if you want words with "qu" and "th" but
> want to *skip* ones with an "ea"?
>=20
>   my @hits =3D $DICT.IO.open( :r ).lines.grep({/qu/}).grep({/th/}).grep({=
!/ea/});
>   # [Asquith discotheque discotheque's discotheques quoth]

Maybe something like this? (note the "!" instead of "?")

  /^ <?before .* "qu" > <?before .* "th" > <!before .* "ea" > /

G'luck,
Peter

--=20
Peter Pentchev  roam@ringlet.net roam@debian.org pp@storpool.com
PGP key:        http://people.FreeBSD.org/~roam/roam.key.asc
Key fingerprint 2EE7 A7A5 17FC 124C F115  C354 651E EFB0 2527 DF13

--r+TdwuXy+OXS8TUs
Content-Type: application/pgp-signature; name="signature.asc"

-----BEGIN PGP SIGNATURE-----

iQIzBAABCgAdFiEELuenpRf8EkzxFcNUZR7vsCUn3xMFAl6/7HAACgkQZR7vsCUn
3xNWKxAAmNRJTqAesvx3G/l/mnmFBRDf3CdmlYnT64h3dfo/+G/ZUQZ+ENqPRVYX
qly58aOPgDCUtJ+msRVkLn8ZP+OLOFTcK1wG9ou+UFCbM1e1dLaILATDdrkQG4WH
6s+G7fO3JYJz+vNFXwvfZ4xKrTUIn4YRTcXnAD0JtNrUMfjkbSuF8tD7ailxjdpl
NH1JbZpLKWB1+zP3tbOIGkdMQU/Eb73pR/El9vcAS7Pzj0bSTJKzAOIqlqedhxYo
s5r/3K4zLSKTQB5CjMzNYfKlXpokxM80visajDKGfLF1pYr3uQ0SNrdEt+aLhszD
afSntp4tEa2BRvE5IdvipSSNP4Z13nI6ZeuHVH/+oOq16xwe88W3z/rJp844HRqJ
cD25hH9tg7m/QwEtcZGXVxK09gEHSMkVOynoez/U1GcatK2CrF8eXEuwpTInFUJ+
HvvvBOIjjT0H01WlJhcr5Mw9ksfAZh7vcxMqaXnnIZnJeudsSorRG99T6m2OMI9G
zy1aDUwCfE44SFaYiZz9X83BUMRvcWfV81rPqWLVI8/k/7SwKMSIjWYFtBWnq5P+
Ae/pmDKySErKftbys4xEaByecJ7NbVQP5YhBPaHR1aoEXnar7CmLE0L6krMZk1IP
PqgWBXq4maLODSOHVUHsiI8xk/W7IRCOMe1q3hnjq2rL+y8XpLA=
=WF+T
-----END PGP SIGNATURE-----

--r+TdwuXy+OXS8TUs--
0
roam
5/16/2020 1:36:53 PM
On Fri, May 15, 2020 at 7:33 PM Joseph Brenner <doomvox@gmail.com> wrote:
>
> Regex engines by their nature care a lot about order, but I
> occasionally want to relax that to match for multiple
> multicharacter subpatterns where the order of them doesn't
> matter.
>
> Frequently the simplest thing to do is just to just do multiple
> matches.   Let's say you're looking for words that have a "qu" a
> "th" and also, say an "ea".  This works:
>
>   my $DICT  =3D "/usr/share/dict/american-english";
>   my @hits =3D $DICT.IO.open( :r ).lines.grep({/qu/}).grep({/th/}).grep({=
/ea/});
>   say @hits;
>   # [bequeath bequeathed bequeathing bequeaths earthquake earthquake's
> earthquakes]
>
>
> It could be useful to be able to do it as one match though, for
> example, you might be using someone else's routine which takes a
> single regex as argument.  I've been known to write things like
> this:
>
>   my regex qu_th_ea   {  [ qu .*? th .*? ea ] |
>                          [ qu .*? ea .*? th ] |
>                          [ th .*? qu .*? ea ] |
>                          [ th .*? ea .*? qu ] |
>                          [ ea .*? th .*? qu ] |
>                          [ ea .*? qu .*? th ]  };
>   my @hits =3D $DICT.IO.open( :r ).lines.grep({/<qu_th_ea>/});
>
> That works, but it gets unwieldy quickly if you need to scale up
> the number of subpatterns.
>
> Recently though, I noticed the "conjunctions" feature, and it
> occured to me that this could be a very neat way of handling
> these things:
>
>   my regex qu_th_ea { ^ [ .* qu .* & .* th .* & .* ea .* ] $ };
>
> That's certainly much better, though unfortunately each element
> of the conjunction needs to match a substring of the same length,
> so pretty frequently you're stuck with the visual noise of
> bracketing subpatterns with pairs of .*
>
> Where things get interesting is when you want a negated match of
> one of the subpatterns.  One of the things I like about the first
> approach using multiple chained greps is that it's easy to do a
> reverse match.  What if you want words with "qu" and "th" but
> want to *skip* ones with an "ea"?
>
>   my @hits =3D $DICT.IO.open( :r ).lines.grep({/qu/}).grep({/th/}).grep({=
!/ea/});
>   # [Asquith discotheque discotheque's discotheques quoth]
>
> To do that in one regex, it would be nice if there were some sort
> of adverb to do a reverse match, like say :not, then it
> would be straight-forward (NOTE: NON-WORKING CODE):
>
>   my regex qu_th_ea { ^ [ .* qu .* & .* th .* &  [ :not .* ea .* ] ] $ };
>
> But since there isn't an adverb like this, what else might we do?
> The best idea I can come up with is this:
>
>   my regex qu_th_ea { ^ [ .* qu .* & .* th .* &  [ <!after ea> . ]*  ] $ =
};
>
> Where the third element of the conjunction should match only if
> none of the characters follow "ea".  There's an oddity here
> though in that I think this can get confused by things like an
> "ea" that *precedes* the conjunction.
>
> So, the question then is: is there a neater way to embed a
> subpattern in a regex that does a negated match?

My two cents: Here's that Github issue Yary opened on creating custom
character classes. Some ideas expressed may be useful for figuring out
how to do a "negative regex":

https://github.com/Raku/problem-solving/issues/97

One take-home message that I've ascertained from the github discussion
above is that Raku/Perl6 regexes are run "wide-open" across all
umpteen Unicode hyperplanes (blocks/scripts), i.e. for a 'positive'
regex, matches will only be limited by the source document you're
scanning against. So the match below against Bengali digits works
right out of the box (REPL below):

> say $/ if '=E0=A7=A6=E0=A7=A7=E0=A7=A8=E0=A7=A9=E0=A7=AA=E0=A7=AB=E0=A7=
=AC=E0=A7=AD=E0=A7=AE=E0=A7=AF' ~~ / \d+ /;
=EF=BD=A2=E0=A7=A6=E0=A7=A7=E0=A7=A8=E0=A7=A9=E0=A7=AA=E0=A7=AB=E0=A7=AC=E0=
=A7=AD=E0=A7=AE=E0=A7=AF=EF=BD=A3
>

Negative regexes are more problematic: with the example above, do you
really want to return all "non-digit" characters? Or only alphanumeric
characters without the numeric component? Or only English ('Latin')
alphanumeric characters without the numeric component?

> say $/ if '=E0=A7=A6=E0=A7=A7=E0=A7=A8=E0=A7=A9=E0=A7=AA=E0=A7=AB=E0=A7=
=AC=E0=A7=AD=E0=A7=AE=E0=A7=AF' ~~ / <-[\d]>+ /;
()
> say $/ if '=E0=A6=85=E0=A6=86=E0=A6=87 =E0=A7=A6=E0=A7=A7=E0=A7=A8=E0=A7=
=A9=E0=A7=AA=E0=A7=AB=E0=A7=AC=E0=A7=AD=E0=A7=AE=E0=A7=AF' ~~ / <-[\d]>+ /;
=EF=BD=A2=E0=A6=85=E0=A6=86=E0=A6=87 =EF=BD=A3
>

Above the Bengali letters "a", "aa", and "i" are (correctly) returned
by the second regex test. However in the contrived example below,
where I intersperse Latin letters with Bengali numbers ('a=E0=A7=A6b=E0=A7=
=A7c=E0=A7=A8') and
wrap the construct in angle brackets, I see a match of
"Latin-non-digits" adjacent to "non-Latin-digits". This says to me
that the call to (filter of) "<:Script<Latin>" does NOT distribute
over the two different digit requirements, "-[\d]" and "+[\d]".
Moreover, I don't know HOW to get it to distribute over the two
different digit requirements, even if I "expand" the regex:

> say $/ if 'a=E0=A7=A6b=E0=A7=A7c=E0=A7=A8' ~~ / <:Script<Latin>-[\d]+[\d]=
>+ /;
=EF=BD=A2a=E0=A7=A6b=E0=A7=A7c=E0=A7=A8=EF=BD=A3
> say $/ if 'a=E0=A7=A6b=E0=A7=A7c=E0=A7=A8' ~~ / <:Script<Latin>-[\d]> <:S=
cript<Latin>+[\d]> /;
=EF=BD=A2a=E0=A7=A6=EF=BD=A3
>

Anyway Joseph, I like your point about using a conjunction and/or
adverb to do a reverse match. I just: 1) wanted to expand the
conversation to Unicode scripts/blocks, and 2) I posted the contrived
("interspersed") examples above in the hope that someone will reply on
the list to explain what I'm doing wrong (because I don't know how to
subset the second half of the regex to only return **Latin digits**).

Best Regards, Bill.
0
perl6
5/16/2020 8:42:45 PM
This is pretty interesting, though I think you're talking about a
different subject... I was talking about cases where the
sub-patterns are more than one character long.  It's true that if
you were interested in single-character sub-patterns, then you
could get close with character classes and negated character
classes.  For example, suppose you were not only interested in
matching the word "the" but you were also interested in finding
it even if it had typos like "teh" or "hte" or something.  Then a
pattern like /<[the]>**3/ would definitely find every permutation
of "t", "h", and "e", and it could be it's good enough for you,
butit would also match cases with duplications like "ttt", "hhh",
"thh" and so on, so it's a broader match than the techniques I
was talking about (it gets the combinations, not just the
permutations).


On 5/16/20, William Michels <wjm1@caa.columbia.edu> wrote:
> On Fri, May 15, 2020 at 7:33 PM Joseph Brenner <doomvox@gmail.com> wrote:
>>
>> Regex engines by their nature care a lot about order, but I
>> occasionally want to relax that to match for multiple
>> multicharacter subpatterns where the order of them doesn't
>> matter.
>>
>> Frequently the simplest thing to do is just to just do multiple
>> matches.   Let's say you're looking for words that have a "qu" a
>> "th" and also, say an "ea".  This works:
>>
>>   my $DICT  =3D "/usr/share/dict/american-english";
>>   my @hits =3D $DICT.IO.open( :r
>> ).lines.grep({/qu/}).grep({/th/}).grep({/ea/});
>>   say @hits;
>>   # [bequeath bequeathed bequeathing bequeaths earthquake earthquake's
>> earthquakes]
>>
>>
>> It could be useful to be able to do it as one match though, for
>> example, you might be using someone else's routine which takes a
>> single regex as argument.  I've been known to write things like
>> this:
>>
>>   my regex qu_th_ea   {  [ qu .*? th .*? ea ] |
>>                          [ qu .*? ea .*? th ] |
>>                          [ th .*? qu .*? ea ] |
>>                          [ th .*? ea .*? qu ] |
>>                          [ ea .*? th .*? qu ] |
>>                          [ ea .*? qu .*? th ]  };
>>   my @hits =3D $DICT.IO.open( :r ).lines.grep({/<qu_th_ea>/});
>>
>> That works, but it gets unwieldy quickly if you need to scale up
>> the number of subpatterns.
>>
>> Recently though, I noticed the "conjunctions" feature, and it
>> occured to me that this could be a very neat way of handling
>> these things:
>>
>>   my regex qu_th_ea { ^ [ .* qu .* & .* th .* & .* ea .* ] $ };
>>
>> That's certainly much better, though unfortunately each element
>> of the conjunction needs to match a substring of the same length,
>> so pretty frequently you're stuck with the visual noise of
>> bracketing subpatterns with pairs of .*
>>
>> Where things get interesting is when you want a negated match of
>> one of the subpatterns.  One of the things I like about the first
>> approach using multiple chained greps is that it's easy to do a
>> reverse match.  What if you want words with "qu" and "th" but
>> want to *skip* ones with an "ea"?
>>
>>   my @hits =3D $DICT.IO.open( :r
>> ).lines.grep({/qu/}).grep({/th/}).grep({!/ea/});
>>   # [Asquith discotheque discotheque's discotheques quoth]
>>
>> To do that in one regex, it would be nice if there were some sort
>> of adverb to do a reverse match, like say :not, then it
>> would be straight-forward (NOTE: NON-WORKING CODE):
>>
>>   my regex qu_th_ea { ^ [ .* qu .* & .* th .* &  [ :not .* ea .* ] ] $ }=
;
>>
>> But since there isn't an adverb like this, what else might we do?
>> The best idea I can come up with is this:
>>
>>   my regex qu_th_ea { ^ [ .* qu .* & .* th .* &  [ <!after ea> . ]*  ] $
>> };
>>
>> Where the third element of the conjunction should match only if
>> none of the characters follow "ea".  There's an oddity here
>> though in that I think this can get confused by things like an
>> "ea" that *precedes* the conjunction.
>>
>> So, the question then is: is there a neater way to embed a
>> subpattern in a regex that does a negated match?
>
> My two cents: Here's that Github issue Yary opened on creating custom
> character classes. Some ideas expressed may be useful for figuring out
> how to do a "negative regex":
>
> https://github.com/Raku/problem-solving/issues/97
>
> One take-home message that I've ascertained from the github discussion
> above is that Raku/Perl6 regexes are run "wide-open" across all
> umpteen Unicode hyperplanes (blocks/scripts), i.e. for a 'positive'
> regex, matches will only be limited by the source document you're
> scanning against. So the match below against Bengali digits works
> right out of the box (REPL below):
>
>> say $/ if '=E0=A7=A6=E0=A7=A7=E0=A7=A8=E0=A7=A9=E0=A7=AA=E0=A7=AB=E0=A7=
=AC=E0=A7=AD=E0=A7=AE=E0=A7=AF' ~~ / \d+ /;
> =EF=BD=A2=E0=A7=A6=E0=A7=A7=E0=A7=A8=E0=A7=A9=E0=A7=AA=E0=A7=AB=E0=A7=AC=
=E0=A7=AD=E0=A7=AE=E0=A7=AF=EF=BD=A3
>>
>
> Negative regexes are more problematic: with the example above, do you
> really want to return all "non-digit" characters? Or only alphanumeric
> characters without the numeric component? Or only English ('Latin')
> alphanumeric characters without the numeric component?
>
>> say $/ if '=E0=A7=A6=E0=A7=A7=E0=A7=A8=E0=A7=A9=E0=A7=AA=E0=A7=AB=E0=A7=
=AC=E0=A7=AD=E0=A7=AE=E0=A7=AF' ~~ / <-[\d]>+ /;
> ()
>> say $/ if '=E0=A6=85=E0=A6=86=E0=A6=87 =E0=A7=A6=E0=A7=A7=E0=A7=A8=E0=A7=
=A9=E0=A7=AA=E0=A7=AB=E0=A7=AC=E0=A7=AD=E0=A7=AE=E0=A7=AF' ~~ / <-[\d]>+ /;
> =EF=BD=A2=E0=A6=85=E0=A6=86=E0=A6=87 =EF=BD=A3
>>
>
> Above the Bengali letters "a", "aa", and "i" are (correctly) returned
> by the second regex test. However in the contrived example below,
> where I intersperse Latin letters with Bengali numbers ('a=E0=A7=A6b=E0=
=A7=A7c=E0=A7=A8') and
> wrap the construct in angle brackets, I see a match of
> "Latin-non-digits" adjacent to "non-Latin-digits". This says to me
> that the call to (filter of) "<:Script<Latin>" does NOT distribute
> over the two different digit requirements, "-[\d]" and "+[\d]".
> Moreover, I don't know HOW to get it to distribute over the two
> different digit requirements, even if I "expand" the regex:
>
>> say $/ if 'a=E0=A7=A6b=E0=A7=A7c=E0=A7=A8' ~~ / <:Script<Latin>-[\d]+[\d=
]>+ /;
> =EF=BD=A2a=E0=A7=A6b=E0=A7=A7c=E0=A7=A8=EF=BD=A3
>> say $/ if 'a=E0=A7=A6b=E0=A7=A7c=E0=A7=A8' ~~ / <:Script<Latin>-[\d]> <:=
Script<Latin>+[\d]> /;
> =EF=BD=A2a=E0=A7=A6=EF=BD=A3
>>
>
> Anyway Joseph, I like your point about using a conjunction and/or
> adverb to do a reverse match. I just: 1) wanted to expand the
> conversation to Unicode scripts/blocks, and 2) I posted the contrived
> ("interspersed") examples above in the hope that someone will reply on
> the list to explain what I'm doing wrong (because I don't know how to
> subset the second half of the regex to only return **Latin digits**).
>
> Best Regards, Bill.
>
0
doomvox
5/17/2020 12:49:08 AM
Yes, both of those work, and arguably they're a little cleaner
looking than my conjunction approach-- though it's not necessarily any
easier to think about.  It looks like a pattern that's matching
for three things in order, but the zero-widthness of the "before"
let's them all work on top of each other.

I keep thinking there's an edge case in these before/after tricks that
might matter if we weren't matching the one-word-per-line format of
the unix dictionaries, but I need to think about that a little more...



 Peter Pentchev <roam@ringlet.net> wrote:
> On Fri, May 15, 2020 at 07:32:50PM -0700, Joseph Brenner wrote:
>> Regex engines by their nature care a lot about order, but I
>> occasionally want to relax that to match for multiple
>> multicharacter subpatterns where the order of them doesn't
>> matter.
>>
>> Frequently the simplest thing to do is just to just do multiple
>> matches.   Let's say you're looking for words that have a "qu" a
>> "th" and also, say an "ea".  This works:
>>
>>   my $DICT  = "/usr/share/dict/american-english";
>>   my @hits = $DICT.IO.open( :r
>> ).lines.grep({/qu/}).grep({/th/}).grep({/ea/});
>>   say @hits;
>>   # [bequeath bequeathed bequeathing bequeaths earthquake earthquake's
>> earthquakes]
>
> Would something like this work for you?
>
>   /^ <?before .* "qu" > <?before .* "th" > <?before .* "ea" > /
>
>> Where things get interesting is when you want a negated match of
>> one of the subpatterns.  One of the things I like about the first
>> approach using multiple chained greps is that it's easy to do a
>> reverse match.  What if you want words with "qu" and "th" but
>> want to *skip* ones with an "ea"?
>>
>>   my @hits = $DICT.IO.open( :r
>> ).lines.grep({/qu/}).grep({/th/}).grep({!/ea/});
>>   # [Asquith discotheque discotheque's discotheques quoth]
>
> Maybe something like this? (note the "!" instead of "?")
>
>   /^ <?before .* "qu" > <?before .* "th" > <!before .* "ea" > /
>
> G'luck,
> Peter
>
> --
> Peter Pentchev  roam@ringlet.net roam@debian.org pp@storpool.com
> PGP key:        http://people.FreeBSD.org/~roam/roam.key.asc
> Key fingerprint 2EE7 A7A5 17FC 124C F115  C354 651E EFB0 2527 DF13
>
0
doomvox
5/17/2020 12:53:04 AM
--kUmJ0kIGcKOYePjT
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Sat, May 16, 2020 at 05:53:04PM -0700, Joseph Brenner wrote:
>  Peter Pentchev <roam@ringlet.net> wrote:
> > On Fri, May 15, 2020 at 07:32:50PM -0700, Joseph Brenner wrote:
> >> Regex engines by their nature care a lot about order, but I
> >> occasionally want to relax that to match for multiple
> >> multicharacter subpatterns where the order of them doesn't
> >> matter.
> >>
> >> Frequently the simplest thing to do is just to just do multiple
> >> matches.   Let's say you're looking for words that have a "qu" a
> >> "th" and also, say an "ea".  This works:
> >>
> >>   my $DICT  =3D "/usr/share/dict/american-english";
> >>   my @hits =3D $DICT.IO.open( :r
> >> ).lines.grep({/qu/}).grep({/th/}).grep({/ea/});
> >>   say @hits;
> >>   # [bequeath bequeathed bequeathing bequeaths earthquake earthquake's
> >> earthquakes]
> >
> > Would something like this work for you?
> >
> >   /^ <?before .* "qu" > <?before .* "th" > <?before .* "ea" > /
> >
> >> Where things get interesting is when you want a negated match of
> >> one of the subpatterns.  One of the things I like about the first
> >> approach using multiple chained greps is that it's easy to do a
> >> reverse match.  What if you want words with "qu" and "th" but
> >> want to *skip* ones with an "ea"?
> >>
> >>   my @hits =3D $DICT.IO.open( :r
> >> ).lines.grep({/qu/}).grep({/th/}).grep({!/ea/});
> >>   # [Asquith discotheque discotheque's discotheques quoth]
> >
> > Maybe something like this? (note the "!" instead of "?")
> >
> >   /^ <?before .* "qu" > <?before .* "th" > <!before .* "ea" > /
> >
>
> Yes, both of those work, and arguably they're a little cleaner
> looking than my conjunction approach-- though it's not necessarily any
> easier to think about.  It looks like a pattern that's matching
> for three things in order, but the zero-widthness of the "before"
> let's them all work on top of each other.
>=20
> I keep thinking there's an edge case in these before/after tricks that
> might matter if we weren't matching the one-word-per-line format of
> the unix dictionaries, but I need to think about that a little more...

Actually, there is, and I conveniently did not mention it :) It's the
case when the patterns may overlap: if you do the '<?before' thing with
'the' and 'entrance', you might match 'thentrance', which, depending on
your use case, might not be ideal.

I've thought a little about another method: splitting the string using
one of the patterns as a separator, then splitting each of the resulting
substrings using the next one and so on until you get to the last one,
where you check whether any of the ministrings contains it, but it would
have to be done carefully, it would have to somehow be done with
a special split-like function that would find all of the occurrences of
the pattern and return tuples "before" and "after" to avoid another kind
of problems with overlaps: if you split "the father" on all of
the ocurrences of "the" at the same time, you *will* miss "father" :)
So you need a special sort of split function that will split
"the father" first as ("", " father"), then as ("the fa", "r"), and return
all of the non-empty results (" father", "the fa", "r")... I'm not sure
this will be very efficient. OK, so as a microoptimization it may return
all of the results that are at least as long as the shortest pattern
remaining, but it still sounds weird.

G'luck,
Peter

--=20
Peter Pentchev  roam@ringlet.net roam@debian.org pp@storpool.com
PGP key:        http://people.FreeBSD.org/~roam/roam.key.asc
Key fingerprint 2EE7 A7A5 17FC 124C F115  C354 651E EFB0 2527 DF13

--kUmJ0kIGcKOYePjT
Content-Type: application/pgp-signature; name="signature.asc"

-----BEGIN PGP SIGNATURE-----

iQIzBAABCgAdFiEELuenpRf8EkzxFcNUZR7vsCUn3xMFAl7AlFQACgkQZR7vsCUn
3xNLmxAAo8i9R9LNwVqUPa81Y+HMMvJPbRsnbdG2CcLlgc2Js6nsUYiRHRu3At3P
L9xXuXHS1lKqJqFJaoJBeNXD3/EXE5cwQP0hj7ChU939RvvFzFfzPYYXI+kERQrO
dquFisBcpcZgC90LaBFbBAZ66v28NZfRhEwbNQMoTZRw163CiBmNHmxcUPTwlVni
qEB8p9z6+aYDC/tq58QucIoHlIroJLY6Gh/iPOjgHO+GJk1KeUNC4XDQGxKeR3AP
8wK3Hn0/oFc4lAhsezNZzQDukW5BPxKAB9Emr+UO/cMUd9RMUh7yX2QUvKE0Wbvh
ATtbCBIGVlXvg+E+04BzNMQB0poIU4o3AfFOeEcHMleXO445npOs/IRY7W0P7k8y
3EBrHGUhT9Wg74aHidKbTgJcwG+XrE4d/EHKGasJgl7yFD/7SYhTOivtBi7Te7i1
2M6toRL/2ZejnkWSmk9zs4GAxTubElyx0vV9HBtTJ2L413j91NTZuJJF1crJeVCb
TN394dvnyIqcqGgW/+VELDKYvftA/hjAsCARyUHntCOr0T+a6Oppdx1YgKdQA96d
JsBb3wAJoK2e6R7jsr2Kr6KTfDN5IssJhsHrK0gSdRbUtGcDk0lUTlMQfEk3tqsP
Gk+oXNR1GldOhjAkTcpo81lES8Jo8vpQ/5qDkLVTQeuTdU52Gx8=
=9cV+
-----END PGP SIGNATURE-----

--kUmJ0kIGcKOYePjT--
0
roam
5/17/2020 1:33:13 AM
Peter Pentchev wrote:

> Actually, there is, and I conveniently did not mention it :) It's the
> case when the patterns may overlap: if you do the '<?before' thing with
> 'the' and 'entrance', you might match 'thentrance', which, depending on
> your use case, might not be ideal.

That's a good point, but it's true that that it's a matter of
your actual use case.   Using your technique to look for words
with a "qu", "ue" and a "en":

  my regex qu_th_not_ea_3b_pos
     { ^ <?before .* "qu" > <?before .* "ue" > <before .* "en" > };

That matches an overlapping case like "queen", but that strikes me as
okay-- it might surprise someone, but it'd be a pretty minor surprise
(unless maybe if you were playing scrabble and trying to get rid
of all your letters...).

Also, things like my multipattern triple-grep approach would show
the same behavior.

The kind of edge case I was talking about was with things like my
conjunction approach, using a negated after to get a negative match:

  my regex qu_th_not_ea
    { ^ [ .* qu .* & .* th .* &  [ <!after ea>. ]* ] $ };

That works pretty well, but it would pass a string like "quothea"
in error, because that after is making sure that none of the
characters in the string follow a "ea", and when the "ea" is at
the end then there is nothing that follows.   Your idiom working
off of ^ is better, because every string has a beginning...

And of course, if you were matching for words in multiword text
then you'd just use that word boundary, so that's no problem.

(I was also worrying vaugely about some other things that don't
pan out, like what if there was an additional "qu" in front of
what you were trying to match and there was no way to subdivide
by lines or words or something-- but what could that possibly
mean?  In that case the "qu" is just part of the string and fair
game to match against, so...)



On 5/16/20, Peter Pentchev <roam@ringlet.net> wrote:
> On Sat, May 16, 2020 at 05:53:04PM -0700, Joseph Brenner wrote:
>>  Peter Pentchev <roam@ringlet.net> wrote:
>> > On Fri, May 15, 2020 at 07:32:50PM -0700, Joseph Brenner wrote:
>> >> Regex engines by their nature care a lot about order, but I
>> >> occasionally want to relax that to match for multiple
>> >> multicharacter subpatterns where the order of them doesn't
>> >> matter.
>> >>
>> >> Frequently the simplest thing to do is just to just do multiple
>> >> matches.   Let's say you're looking for words that have a "qu" a
>> >> "th" and also, say an "ea".  This works:
>> >>
>> >>   my $DICT  = "/usr/share/dict/american-english";
>> >>   my @hits = $DICT.IO.open( :r
>> >> ).lines.grep({/qu/}).grep({/th/}).grep({/ea/});
>> >>   say @hits;
>> >>   # [bequeath bequeathed bequeathing bequeaths earthquake earthquake's
>> >> earthquakes]
>> >
>> > Would something like this work for you?
>> >
>> >   /^ <?before .* "qu" > <?before .* "th" > <?before .* "ea" > /
>> >
>> >> Where things get interesting is when you want a negated match of
>> >> one of the subpatterns.  One of the things I like about the first
>> >> approach using multiple chained greps is that it's easy to do a
>> >> reverse match.  What if you want words with "qu" and "th" but
>> >> want to *skip* ones with an "ea"?
>> >>
>> >>   my @hits = $DICT.IO.open( :r
>> >> ).lines.grep({/qu/}).grep({/th/}).grep({!/ea/});
>> >>   # [Asquith discotheque discotheque's discotheques quoth]
>> >
>> > Maybe something like this? (note the "!" instead of "?")
>> >
>> >   /^ <?before .* "qu" > <?before .* "th" > <!before .* "ea" > /
>> >
>>
>> Yes, both of those work, and arguably they're a little cleaner
>> looking than my conjunction approach-- though it's not necessarily any
>> easier to think about.  It looks like a pattern that's matching
>> for three things in order, but the zero-widthness of the "before"
>> let's them all work on top of each other.
>>
>> I keep thinking there's an edge case in these before/after tricks that
>> might matter if we weren't matching the one-word-per-line format of
>> the unix dictionaries, but I need to think about that a little more...
>
> Actually, there is, and I conveniently did not mention it :) It's the
> case when the patterns may overlap: if you do the '<?before' thing with
> 'the' and 'entrance', you might match 'thentrance', which, depending on
> your use case, might not be ideal.
>
> I've thought a little about another method: splitting the string using
> one of the patterns as a separator, then splitting each of the resulting
> substrings using the next one and so on until you get to the last one,
> where you check whether any of the ministrings contains it, but it would
> have to be done carefully, it would have to somehow be done with
> a special split-like function that would find all of the occurrences of
> the pattern and return tuples "before" and "after" to avoid another kind
> of problems with overlaps: if you split "the father" on all of
> the ocurrences of "the" at the same time, you *will* miss "father" :)
> So you need a special sort of split function that will split
> "the father" first as ("", " father"), then as ("the fa", "r"), and return
> all of the non-empty results (" father", "the fa", "r")... I'm not sure
> this will be very efficient. OK, so as a microoptimization it may return
> all of the results that are at least as long as the shortest pattern
> remaining, but it still sounds weird.
>
> G'luck,
> Peter
>
> --
> Peter Pentchev  roam@ringlet.net roam@debian.org pp@storpool.com
> PGP key:        http://people.FreeBSD.org/~roam/roam.key.asc
> Key fingerprint 2EE7 A7A5 17FC 124C F115  C354 651E EFB0 2527 DF13
>
0
doomvox
5/17/2020 6:40:30 PM
Reply: