Regular expression substitution matching all possible matches

This is a question about perl regexp usage.  I hope it's okay to post it here;
it seemed too advanced for the beginners list.  It may lead on to a possible
feature request for the perl core.

I would like to perform match-and-substitute finding all possible matches of
a regular expression within a string.  By 'all possible matches' I mean all
productions from that regular expression, not just the production that perl's
regexp engine happened to find first.  So given the regexp /\Ax*/ and the
string 'xxx', I would expect to see matches '', 'x', 'xx', and 'xxx'.

At first sight it appears that the /g flag will provide this, but in fact it
just tells perl to keep matching once it has consumed some of the input string,
while I want it to keep backtracking and finding more possible matches even
after a success.

The perlre manual page suggests that using (*FAIL) together with ?{} can do it:

    'xxx' =~ /\Ax*(?{say $&})(*FAIL)/;

This gives the expected result.  Great!  But now how to do a s/// substitution?
I would like to find all possible matches and get all resulting substitutions.
So far I have hand-rolled code for that in perl:

    my $string = 'xxx';
    my @m;
    $string =~
/\A(x*)(?{push @m, [ ${^PREMATCH}, ${^POSTMATCH}, [ @- ], [ @+ ] ]})(*FAIL)/p;

    my $replacement = '.$1.';
    foreach (@m) {
        my ($pre, $post, $starts, $ends) = @$_;

        my @captured;
        foreach my $n (1 .. $#$starts) {
            my $start = $starts->[$n] // die;
            my $end = $ends->[$n] // die;
           push @captured, substr($string, $start, $end - $start);
        }

        (my $r = $replacement)
            =~ s{\$(\d+)}{$captured[$1 - 1] // die}ge;
        say $pre . $r . $post;
    }

This appears to work but it is unsatisfactory.  Can I coax perl's regexp
engine into finding all possible matches and gathering them into an array?

Would there be an appetite for adding a new regexp flag to find all possible
matches?  It would be a useful teaching tool, and would also make it possible
to find the longest or shortest match.

-- 
Ed Avis <eda@waniasset.com>



0
eda
11/3/2010 1:37:45 PM
perl.perl5.porters 48287 articles. 1 followers. Follow

20 Replies
896 Views

Similar Articles

[PageSpeed] 6
Get it on Google Play
Get it on Apple App Store

Ed Avis wrote:
>This gives the expected result.  Great!  But now how to do a s/// substitution?
>I would like to find all possible matches and get all resulting substitutions.

You can do this a bit more neatly by generating the substituted strings
in the (?{}) block, where $1 et al are still live.

>Would there be an appetite for adding a new regexp flag to find all possible
>matches?

Not from here.  It's a strange requirement, especially the one where
you're gathering all resulting substituted strings, and the mechanism
currently available seems an appropriate way to make the hard thing
possible.

-zefram
0
zefram
11/3/2010 1:48:51 PM
Zefram <zefram <at> fysh.org> writes:

[generating all possible regexp matches, and resulting substitutions]

>You can do this a bit more neatly by generating the substituted strings
>in the (?{}) block, where $1 et al are still live.

You're right.  I didn't use these numbered variables because I couldn't work
out how to get all of them in general (is there an array which contains $1, $2
etc?).  So I used @- and @+ instead.

However, if we allow the use of eval() then the code can be simplified:

    my $string = 'xxx';
    my @r;
    my $replacement = '.$1.';
$string =~
/\A(x*)(?{push @r, ${^PREMATCH} . eval(qq{"$replacement"}) .
${^POSTMATCH}})(*FAIL)/p;

This is unsafe (what if $replacement contains a " character?) but works.

>>Would there be an appetite for adding a new regexp flag to find all possible
>>matches?
>
>Not from here.  It's a strange requirement, especially the one where
>you're gathering all resulting substituted strings,

To give a little context, the code I'm writing is a series of rewrite rules to
translate an input string written in a simple language into a different
language.  The LHS of each rule is a regexp and the RHS may then recursively
translate a captured substring.  Each rule is matched against the string in
turn to see which ones fire.

Now I needed to translate strings looking like 'Foo_Lim' in the input language
to "Limit(P('Foo'))" in the output language, where P('Foo') is a recursive call
to translate the string 'Foo'.  The rules are thus

    qr/\A(.+)_Lim/

    qr/\AFoo\z/    => { 42 }

Applying both of these to the input string, the first rule matches and
in the recursive call the second one matches, so the result is 'Limit(42)'.

But now suppose I want to also handle 'FooLim' without the underscore.  No
problem, simply make the regular expression in the first rule more permissive:
change from /\A(.+)_Lim/ to /\A(.+)_?Lim/.  Clearly this new regexp matches
(or generates) all strings that the old one did, so this should be a safe
change, it seems at first.  However, it unexpectedly stops the original
'Foo_Lim' input string from working, because the (.+) consumes the _ character
too.

In this particular case using a nongreedy match will work: /\A(.+?)_Lim/.
In general, working with a set of all matches will avoid the need to care about
marking greedy or nongreedy.  It would also let you know about ambiguous parses.

As you say, gathering all the substituted strings is a little weird.  If you
were doing some sophisticated parsing or rule-matching using regexps, you would
not normally try to shoehorn it into the right-hand-side of an s///.  So
perl's core doesn't need a builtin for 'gather all possible substitutions'.

However, it would be useful to expose the $1,$2 substitution to the user, as a
safe and perhaps faster alternative to the eval() above.  In other words to
open up the s/// mechanism and let people call it by hand:

    if (/(\w+),(\w+)/) {
        # In here $1 and $2 are set.
        $result = perl_builtin_substitution 'replacement_pattern_$1_$2';
    }

This opens the possibility for dynamically generated substitution patterns,
in a more hygienic way than using eval() or the /e flag.

-- 
Ed Avis <eda@waniasset.com>

0
eda
11/3/2010 4:28:56 PM
Somehow I messed up the pseudocode in my earlier message.
I meant to say

 Now I needed to translate strings looking like 'Foo_Lim' in the input language
 to "Limit(P('Foo'))" in the output language, where P('Foo') is a recursive call
 to translate the string 'Foo'.  The rules are thus
  
     qr/\A(.+)_Lim/ => { 'Limit(' . P($1) . ')' }
 
     qr/\AFoo\z/    => { 42 }


0
eda
11/3/2010 4:36:14 PM
Ed Avis <eda@waniasset.com> wrote:
:This is a question about perl regexp usage.  I hope it's okay to post it here;
:it seemed too advanced for the beginners list.  It may lead on to a possible
:feature request for the perl core.
:
:I would like to perform match-and-substitute finding all possible matches of
:a regular expression within a string.  By 'all possible matches' I mean all
:productions from that regular expression, not just the production that perl's
:regexp engine happened to find first.  So given the regexp /\Ax*/ and the
:string 'xxx', I would expect to see matches '', 'x', 'xx', and 'xxx'.

This was specced (at least for the 'match' part) to go into perl6, and it
looks like Perl6::Rules supports it now with the ":e" or ":exhaustive"
modifier. It doesn't mention whether it supports substitution in this case -
and S05 doesn't mention this case either that I can see (I've asked for
clarification).

My gut feel is that even the matching part is unlikely to get implemented
within the perl5 regexp engine: currently, at the point a match is found,
the information required to allow further matching can be encapsulated
in a number plus a flag (pos(), and "zero-length match at this pos");
for :exhaustive support you need the full backtracking state, which
(as far as I can see) implies reengineering on a scale hard to justify
for the population of interested parties (which right now seems to be
mostly you and me).

Hugo
0
hv
11/4/2010 8:24:29 AM
On Thu, Nov 04, 2010 at 08:24:29AM +0000, hv@crypt.org wrote:
> Ed Avis <eda@waniasset.com> wrote:
> :This is a question about perl regexp usage.  I hope it's okay to post it here;
> :it seemed too advanced for the beginners list.  It may lead on to a possible
> :feature request for the perl core.
> :
> :I would like to perform match-and-substitute finding all possible matches of
> :a regular expression within a string.  By 'all possible matches' I mean all
> :productions from that regular expression, not just the production that perl's
> :regexp engine happened to find first.  So given the regexp /\Ax*/ and the
> :string 'xxx', I would expect to see matches '', 'x', 'xx', and 'xxx'.

Curiously enough, I expressed a same wish to Yves earlier this week
in private communication. But then I started wondering. Suppose there
would be such a hypothetical flag (say /E), what should:

    "12345" =~ /\d+\d*/E;

return? Should it exhaustively try all possible submatches for \d+ and \d*?

    "12345" =~ /\d+\d*(?{say $&})(*FAIL)/

will print out "12345" five times. Or should it do the equivalent of

    my %seen;
    "12345" =~ /\d+\d*(?{say $& unless $seen {$- [0] => $&} ++})(*FAIL)/;

which prints "12345" only once. But one might be interested in all the
possible $1 and $2, so perhaps you want the equivalent of:

    my %seen;
    "12345" =~ /(\d+)\d*\d*(\d+)
                (?{say "$1 $2" unless $seen {@- => @+} ++})(*FAIL)/x;

Other questions: what's pos() going to be afterwards? How will it interact
with /gc? and \G? 

> 
> This was specced (at least for the 'match' part) to go into perl6, and it
> looks like Perl6::Rules supports it now with the ":e" or ":exhaustive"
> modifier. It doesn't mention whether it supports substitution in this case -
> and S05 doesn't mention this case either that I can see (I've asked for
> clarification).
> 
> My gut feel is that even the matching part is unlikely to get implemented
> within the perl5 regexp engine: currently, at the point a match is found,
> the information required to allow further matching can be encapsulated
> in a number plus a flag (pos(), and "zero-length match at this pos");
> for :exhaustive support you need the full backtracking state, which
> (as far as I can see) implies reengineering on a scale hard to justify
> for the population of interested parties (which right now seems to be
> mostly you and me).
> 

I don't know much about the internals of the regexp engine, so you may very
well be right, but considering that /(?{say $&})(*FAIL)/ does do the trick
of an exhaustive search, it seems to me much of the needed support has to be
already present.



Abigail
0
abigail
11/4/2010 9:23:23 AM
Abigail <abigail <at> abigail.be> writes:

>>:I would like to perform match-and-substitute finding all possible matches of
>>:a regular expression within a string.

>Curiously enough, I expressed a same wish to Yves earlier this week
>in private communication. But then I started wondering. Suppose there
>would be such a hypothetical flag (say /E), what should:
>
>    "12345" =~ /\d+\d*/E;
>
>return? Should it exhaustively try all possible submatches for \d+ and \d*?
>
>    "12345" =~ /\d+\d*(?{say $&})(*FAIL)/
>
>will print out "12345" five times.

Yes, that's the right result IMHO.  As you say, if the pattern includes
capturing parens, then you must get all possible values of $1, $2 etc.
And clearly, turning a non-capturing (?:...) into a capturing (...) should
not affect the semantics of the match itself.  So yes, return '12345' five
times, which by the way indicates that the regexp /\d+\d*/ is ambiguous.

>Other questions: what's pos() going to be afterwards? How will it interact
>with /gc? and \G? 

Perhaps forbid /E together with these flags until a useful semantics for this
case can be agreed.

>considering that /(?{say $&})(*FAIL)/ does do the trick
>of an exhaustive search, it seems to me much of the needed support has to be
>already present.

Yes, that seems to be the case.  But it might be necessary to gather all the
matches first, then return them all at once.  As I understand it, if you do

    while ('xxx' =~ /\G(x)/gc) {
        say $1;
    }

then the regexp engine may helpfully return results one at a time as it finds
them, keeping state between each match.  That might not be possible for the
'all possible matches' mode with backtracking.  However, since perl doesn't
really support lazy strings anyway, the question of 'one at a time' versus 'all
at once' is an implementation detail not really apparent to the user.

-- 
Ed Avis <eda@waniasset.com>

0
eda
11/4/2010 10:10:38 AM
On 2010-11-04 10:23, Abigail wrote:

> Suppose there
> would be such a hypothetical flag (say /E), what should:
>
>      "12345" =~ /\d+\d*/E;
>
> return?

I would read \d+\d* as \d+, then try for each position because there is 
no anchor, so the return is qw/ 12345 2345 345 45 5 /.

The return of

        "12345" =~ /\d+?/E;

would have more elements:

     qw/ 1 12 123 1234 12345
            2  23  234  2345
                3   34   345
                     4    45
                           5
       /

-- 
Ruud
0
rvtol
11/4/2010 10:39:45 AM
On Thu, Nov 04, 2010 at 11:39:45AM +0100, Dr.Ruud wrote:
> 
> I would read \d+\d* as \d+, then try for each position because there is 
> no anchor, so the return is qw/ 12345 2345 345 45 5 /.
> 
> The return of
> 
>        "12345" =~ /\d+?/E;
> 
> would have more elements:
> 
>     qw/ 1 12 123 1234 12345
>            2  23  234  2345
>                3   34   345
>                     4    45
>                           5
>       /

Why would you expect /\d+/E and /\d+?/E to return a different set of
results?  If you're doing an exhaustive search, \d+ and \d+? match the same
substrings, just in a different order.

Ronald
0
rjk
11/4/2010 7:10:41 PM
On 2010-11-04 20:10, Ronald J Kimball wrote:
> On Thu, Nov 04, 2010 at 11:39:45AM +0100, Dr.Ruud wrote:

>> I would read \d+\d* as \d+, then try for each position because there is
>> no anchor, so the return is qw/ 12345 2345 345 45 5 /.
>>
>> The return of
>>
>>         "12345" =~ /\d+?/E;
>>
>> would have more elements:
>>
>>      qw/ 1 12 123 1234 12345
>>             2  23  234  2345
>>                 3   34   345
>>                      4    45
>>                            5
>>        /
>
> Why would you expect /\d+/E and /\d+?/E to return a different set of
> results?  If you're doing an exhaustive search, \d+ and \d+? match the same
> substrings, just in a different order.

That just depends on your implementation of greediness.

As I wrote, I would implement exhaustive to mean "try for each 
position", basically making the string shorter at each step.
 From there, apply greediness as is.

-- 
Ruud
0
rvtol
11/5/2010 12:06:23 AM
On Fri, Nov 05, 2010 at 01:06:23AM +0100, Dr.Ruud wrote:
> On 2010-11-04 20:10, Ronald J Kimball wrote:
>> On Thu, Nov 04, 2010 at 11:39:45AM +0100, Dr.Ruud wrote:
>
>>> I would read \d+\d* as \d+, then try for each position because there is
>>> no anchor, so the return is qw/ 12345 2345 345 45 5 /.
>>>
>>> The return of
>>>
>>>         "12345" =~ /\d+?/E;
>>>
>>> would have more elements:
>>>
>>>      qw/ 1 12 123 1234 12345
>>>             2  23  234  2345
>>>                 3   34   345
>>>                      4    45
>>>                            5
>>>        /
>>
>> Why would you expect /\d+/E and /\d+?/E to return a different set of
>> results?  If you're doing an exhaustive search, \d+ and \d+? match the same
>> substrings, just in a different order.
>
> That just depends on your implementation of greediness.
>
> As I wrote, I would implement exhaustive to mean "try for each  
> position", basically making the string shorter at each step.
> From there, apply greediness as is.


So, what would be the return value of

    "123" =~ /\d+\w*/E

or

    "123" =~ /\d*\d*?\d*/E

?


Abigail

0
abigail
11/6/2010 3:02:00 PM
--001636c5995a0b61ae0494644b69
Content-Type: text/plain; charset=ISO-8859-1

On Thu, Nov 4, 2010 at 8:06 PM, Dr.Ruud
<rvtol+usenet@isolution.nl<rvtol%2Busenet@isolution.nl>
> wrote:

> As I wrote, I would implement exhaustive to mean "try for each position",
> basically making the string shorter at each step.
> From there, apply greediness as is.
>

That differs from the "usual" existing method of generating all matches.

>perl -E"'12345' =~ /\d+(?{ say $& })(?!)/;
12345
1234
123
12
1
2345
234
23
2
345
34
3
45
4
5

>perl -E"'12345' =~ /\d+?(?{ say $& })(?!)/;
1
12
123
1234
12345
2
23
234
2345
3
34
345
4
45
5

--001636c5995a0b61ae0494644b69--
0
ikegami
11/6/2010 3:46:23 PM
On Thu, Nov 04, 2010 at 11:39:45AM +0100, Dr.Ruud wrote:
> On 2010-11-04 10:23, Abigail wrote:
>
>> Suppose there
>> would be such a hypothetical flag (say /E), what should:
>>
>>      "12345" =~ /\d+\d*/E;
>>
>> return?
>
> I would read \d+\d* as \d+, then try for each position because there is  
> no anchor, so the return is qw/ 12345 2345 345 45 5 /.

So, you think that

    "123123123" =~ /(\d+\d*)\1\1/

matching is a bug?


Note that

    "12345" =~ /(?>\d+\d*)/E 

probably should return qw/ 12345 2345 345 45 5 /.


Abigail
0
abigail
11/6/2010 4:05:51 PM
On 2010-11-06 17:05, Abigail wrote:
 > On Thu, Nov 04, 2010 at 11:39:45AM +0100, Dr.Ruud wrote:
 >> On 2010-11-04 10:23, Abigail wrote:

 >>> Suppose there
 >>> would be such a hypothetical flag (say /E), what should:
 >>>
 >>>       "12345" =~ /\d+\d*/E;
 >>>
 >>> return?
 >>
 >> I would read \d+\d* as \d+, then try for each position because there is
 >> no anchor, so the return is qw/ 12345 2345 345 45 5 /.
 >
 > So, you think that
 >
 >      "123123123" =~ /(\d+\d*)\1\1/
 >
 > matching is a bug?

No, to me those \1's are "anchoring".

        "111111111" =~ /(\d+)\1\1/     ->  111
        "111111111" =~ /(\d+)\1\1/E    ->  111
        "111111111" =~ /(\d+)\1\1/Eg   ->  111

        "111111111" =~ /(\d+?)\1\1/    ->  1
        "111111111" =~ /(\d+?)\1\1/g   ->  1,1,1
        "111111111" =~ /(\d+?)\1\1/Eg  ->
            1,11,111, 1,11, 1,11, 1,11, 1,11, 1

Or would /E be uniqifying?


 > Note that
 >
 >      "12345" =~ /(?>\d+\d*)/E
 >
 > probably should return qw/ 12345 2345 345 45 5 /.

OK

-- 
Ruud
0
rvtol
11/6/2010 9:07:25 PM
On Fri, Nov 05, 2010 at 01:06:23AM +0100, Dr.Ruud wrote:
> On 2010-11-04 20:10, Ronald J Kimball wrote:
> >Why would you expect /\d+/E and /\d+?/E to return a different set of
> >results?  If you're doing an exhaustive search, \d+ and \d+? match the same
> >substrings, just in a different order.
> 
> That just depends on your implementation of greediness.
> 
> As I wrote, I would implement exhaustive to mean "try for each 
> position", basically making the string shorter at each step.
> From there, apply greediness as is.

I guess I don't understand your implementation of greediness.  \d+ matches
1 or more digits, as many as possible; \d+? matches 1 or more digits, as
few as possible.  You're applying backtracking for the non-greedy match,
but not the greedy match.  There's no reason for that inconsistency.

Ronald

0
rjk
11/7/2010 12:03:28 AM
>So, what would be the return value of

>    "123" =~ /\d+\w*/E

>or

>    "123" =~ /\d*\d*?\d*/E

>?

Is there some reason that isn't just:

    "123" =~ m{
	(\d+)(\w*)
	(?{ print "<$1> <$2>\n" })
	(*FAIL)
    }m

    <123> <>
    <12> <3>
    <12> <>
    <1> <23>
    <1> <2>
    <1> <>
    <23> <>
    <2> <3>
    <2> <>
    <3> <>

and

    "123" =~ m{
	(\d*)(\d*?)(\d*)
	(?{ print "<$1> <$2> <$3>\n" })
	(*FAIL)
    }x;

    <123> <> <>
    <12> <> <3>
    <12> <> <>
    <12> <3> <>
    <1> <> <23>
    <1> <> <2>
    <1> <> <>
    <1> <2> <3>
    <1> <2> <>
    <1> <23> <>
    <> <> <123>
    <> <> <12>
    <> <> <1>
    <> <> <>
    <> <1> <23>
    <> <1> <2>
    <> <1> <>
    <> <12> <3>
    <> <12> <>
    <> <123> <>
    <23> <> <>
    <2> <> <3>
    <2> <> <>
    <2> <3> <>
    <> <> <23>
    <> <> <2>
    <> <> <>
    <> <2> <3>
    <> <2> <>
    <> <23> <>
    <3> <> <>
    <> <> <3>
    <> <> <>
    <> <3> <>
    <> <> <>

--tom
0
tchrist
11/7/2010 1:09:43 AM
On 7 November 2010 02:09, Tom Christiansen <tchrist@perl.com> wrote:
>>So, what would be the return value of
>
>> =A0 =A0"123" =3D~ /\d+\w*/E
>
>>or
>
>> =A0 =A0"123" =3D~ /\d*\d*?\d*/E
>
>>?
>
> Is there some reason that isn't just:
>
> =A0 =A0"123" =3D~ m{
> =A0 =A0 =A0 =A0(\d+)(\w*)
> =A0 =A0 =A0 =A0(?{ print "<$1> <$2>\n" })
> =A0 =A0 =A0 =A0(*FAIL)
> =A0 =A0}m
>
> =A0 =A0<123> <>
> =A0 =A0<12> <3>
> =A0 =A0<12> <>
> =A0 =A0<1> <23>
> =A0 =A0<1> <2>
> =A0 =A0<1> <>
> =A0 =A0<23> <>
> =A0 =A0<2> <3>
> =A0 =A0<2> <>
> =A0 =A0<3> <>

Yes. I meant to mention that as well.

Yves


--=20
perl -Mre=3Ddebug -e "/just|another|perl|hacker/"
0
demerphq
11/9/2010 8:05:10 AM
On Sat, Nov 06, 2010 at 07:09:43PM -0600, Tom Christiansen wrote:
> >So, what would be the return value of
> 
> >    "123" =~ /\d+\w*/E
> 
> >or
> 
> >    "123" =~ /\d*\d*?\d*/E
> 
> >?
> 
> Is there some reason that isn't just:
> 
>     "123" =~ m{
> 	(\d+)(\w*)
> 	(?{ print "<$1> <$2>\n" })
> 	(*FAIL)
>     }m
> 
>     <123> <>
>     <12> <3>
>     <12> <>
>     <1> <23>
>     <1> <2>
>     <1> <>
>     <23> <>
>     <2> <3>
>     <2> <>
>     <3> <>
> 
> and
> 
>     "123" =~ m{
> 	(\d*)(\d*?)(\d*)
> 	(?{ print "<$1> <$2> <$3>\n" })
> 	(*FAIL)
>     }x;
> 
>     <123> <> <>
>     <12> <> <3>
>     <12> <> <>
>     <12> <3> <>
>     <1> <> <23>
>     <1> <> <2>
>     <1> <> <>
>     <1> <2> <3>
>     <1> <2> <>
>     <1> <23> <>
>     <> <> <123>
>     <> <> <12>
>     <> <> <1>
>     <> <> <>
>     <> <1> <23>
>     <> <1> <2>
>     <> <1> <>
>     <> <12> <3>
>     <> <12> <>
>     <> <123> <>
>     <23> <> <>
>     <2> <> <3>
>     <2> <> <>
>     <2> <3> <>
>     <> <> <23>
>     <> <> <2>
>     <> <> <>
>     <> <2> <3>
>     <> <2> <>
>     <> <23> <>
>     <3> <> <>
>     <> <> <3>
>     <> <> <>
>     <> <3> <>
>     <> <> <>
> 



Yes, that's what I think. But my question was to Ruud, in the context of
his suggestion Perl should magically turn /\d+\d*/ into /\d+/. 



Abigail
0
abigail
11/9/2010 9:48:44 AM
* Ed Avis <eda@waniasset.com> [2010-11-03 14:35]:
> Can I coax perl's regexp engine into finding all possible
> matches and gathering them into an array?

Yes.

    my $str = "axxxabaxxxa";

    my ( $i, $j, @r ) = ( 0, 0 );

    $j++, $i = 0 while ( $r[@r] = $str )
        =~ s{ xx x? (??{ $i++ < $j ? "(?!)" : "" }) }{ | }x;

    pop @r;

}:->

Will you take the Faustian bargain? (O(n²) runtime instead of
O(n), for want of regex engine match state continuations.)

-- 
*AUTOLOAD=*_;sub _{s/::([^:]*)$/print$1,(",$\/"," ")[defined wantarray]/e;chop;$_}
&Just->another->Perl->hack;
#Aristotle Pagaltzis // <http://plasmasturm.org/>
0
pagaltzis
11/9/2010 12:48:02 PM
On Tue, Nov 09, 2010 at 01:48:02PM +0100, Aristotle Pagaltzis wrote:
> * Ed Avis <eda@waniasset.com> [2010-11-03 14:35]:
> > Can I coax perl's regexp engine into finding all possible
> > matches and gathering them into an array?
> 
> Yes.
> 
>     my $str = "axxxabaxxxa";
> 
>     my ( $i, $j, @r ) = ( 0, 0 );
> 
>     $j++, $i = 0 while ( $r[@r] = $str )
>         =~ s{ xx x? (??{ $i++ < $j ? "(?!)" : "" }) }{ | }x;
> 
>     pop @r;
> 
> }:->
> 
> Will you take the Faustian bargain? (O(n�) runtime instead of
> O(n), for want of regex engine match state continuations.)


What's 'n' in this case? Certainly not the length of the string;
I doubt the regexp engine is O (n^p) for any p, where n is the
length of the string:

   $_ = "1"  x (2 * $p);
   $r = "1*" x $p;
   /$r(?:x|y)/;

This has relative runtime for $p = 1 .. 10: 3, 15, 84, 495, 3003, 18564,
116280, 735471, 4686825, 30045015.


Abigail
0
abigail
11/9/2010 1:37:34 PM
Hi Abigail,

* Abigail <abigail@abigail.be> [2010-11-09 14:35]:
> On Tue, Nov 09, 2010 at 01:48:02PM +0100, Aristotle Pagaltzis wrote:
> > * Ed Avis <eda@waniasset.com> [2010-11-03 14:35]:
> > > Can I coax perl's regexp engine into finding all possible
> > > matches and gathering them into an array?
> >
> > Yes.
> >
> >     my $str = "axxxabaxxxa";
> >
> >     my ( $i, $j, @r ) = ( 0, 0 );
> >
> >     $j++, $i = 0 while ( $r[@r] = $str )
> >         =~ s{ xx x? (??{ $i++ < $j ? "(?!)" : "" }) }{ | }x;
> >
> >     pop @r;
> >
> > }:->
> >
> > Will you take the Faustian bargain? (O(n²) runtime instead of
> > O(n), for want of regex engine match state continuations.)
>
> What's 'n' in this case? Certainly not the length of the
> string; I doubt the regexp engine is O (n^p) for any p, where
> n is the length of the string:
>
>    $_ = "1"  x (2 * $p);
>    $r = "1*" x $p;
>    /$r(?:x|y)/;
>
> This has relative runtime for $p = 1 .. 10: 3, 15, 84, 495,
> 3003, 18564, 116280, 735471, 4686825, 30045015.

the n is number of possible matches.

The first possible match is tried n times, but is only allowed to
succeed on the very first pass. Then it gets failed n-1 times.
The last possible match succeeds once, then is failed once, then
the loop finally terminates.

So n(n-1) forced failures are necessary to n successful matches,
i.e.  n + n(n-1) which is exactly n².

For (mostly) unambiguous rewrite rules this won’t be cause for
much trouble, but with degenerate ones it will bog down quickly.

Regards,
-- 
Aristotle Pagaltzis // <http://plasmasturm.org/>
0
pagaltzis
11/9/2010 6:52:06 PM
Reply:

Similar Artilces:

Can we perform substitution to the matched pattern inside a regular expression so that the modified pattern gets returned instead of earlier matched one ?
--00c09f8e5d3265388a048572153c Content-Type: text/plain; charset=ISO-8859-1 Hello everybody, Can we perform substitution to the matched pattern inside a regular expression so that the modified pattern gets returned instead of earlier matched one ? As a reference, in the following code below, I want to perform the substitution of "~" character with "_" character to the value of "\3" inside a regular expression so that $3 ultimately becomes "are___you___fine?" instead of "are~~~you~~~fine?". I tried checking with the perl docs b...

match not matching
Can somebody help me understand this? Given this loop, and the logged output following ... my $found; for( @$products ) {; $found = $$_ =~ m|$project|; $dump = Data::Dumper->Dump([$_, $project, $$_, $found]); $logger->trace(qq(dump=$dump)); } I can't explain why $found is not true on the 3rd pass. Does this have something to do with the way I'm dereferencing the blessed object? SVN.Hooks - dump=$VAR1 = bless( do{\(my $o = 'FS1100-S1')}, 'RPC::XML::string' ); $VAR2 = 'STABLE/FS1100-S3/RSLOGIX_5000/'; $VAR3 = 'FS1100-S1'...

Regular expression matching
I have the need to generate and match a large number (Typically 50 - 200) of regular expressions. Each regular expression should be written in a subset of the perl regular expression syntax, or another suitable syntax. I have not committed to a specific subset yet. The patterns are not allowed to contain recursive patterns, or interpolated expressions. The input patterns should be processes into new patterns according to the following rules: Each pattern will be wrapped in a capturing paranthesis pair. Nested paranthesises should be turned into non capturing paranthesises. Th...

Regular expression match
I am trying to find a way to allow access to a URL through the verification of an LDAP attribute. I don't care what value the attribute has as it is an aux class extension to the user object. I just want to know if the attribute exists. If I can't find that, then I'll settle for the LDAP Attribute has an assigned value. I can't seem to find a way. I'm thinking I should be able to Regular Expression: Match the LDAP Attribute for any value. I don't know if I have the Reg-Ex Match syntax correct. I am using \d\D for any 0-9 and a-z character. This s...

regular expression match?
Hello, I am fairly new to perl and am having problems matching a specific character. I want to be able to match the "_" character at the end of a string. For example, asume I have an array of strings with the following data: $array{0} = "clk_n" $array{1} = "test_in_6_" $array{2} = "clk_out" I want to create a regular expression which will match $array{1} and not the others in the array. This is what I have so far. foreach (@array) { if(m/$\w+\_/) { print "Matched, $_\n"; } } This isn't giving the resul...

[perl6/specs] 8efa3a: Correct Match.start/Match.end -> Match.from/Match....
----==_mimepart_51421864a530e_53794451388849a Date: Thu, 14 Mar 2013 11:35:16 -0700 Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Content-ID: <51421864aceb8_537944513888528@hookshot-fe6-pe1-prd.aws.github.net.mail> Branch: refs/heads/master Home: https://github.com/perl6/specs Commit: 8efa3ac10649401dbbcb6488975bc65d6b501326 https://github.com/perl6/specs/commit/8efa3ac10649401dbbcb6488975bc65d6b501326 Author: Rob Hoelz <rob@hoelz.ro> Date: 2013-03-14 (Thu, 14 Mar 2013) Changed paths: M S28-spe...

Regular Expression: Need little bit help with multiple word match expressions
Guys, I am trying to write few regular expressions and need little bit of assistance. Example Text: Finally, there is a selection of patterns. Both Words "Finally" and "selection" must exist. Even if I switch the words around the match needs to happen. Otherwise the match must fail. Please advise, Thanks Shrini Try this: finally.*selection|selection.*finally Regards Mike [MVP - ASP/ASP.NET]My site...

Complex Regular Expression Matching
Hello, I have a regular expression that is used to take data from a table. while($innerTable =~ m/\<TR\> \<TD\> \<A HREF=".*?" target="dictionary" TITLE=".*?"\> (.*?)\<\/A\> \<\/TD\> \<TD\> (.*?) \<\/TD\> \<TD\> (.*?) \<\/TD\> \<TD\>(\s*\<A TITLE=".*?" HREF=".*?" TARGET="dictionary"\> (.*?)\<\/A\>=[0-9]+\s*(\<\/A\>)*\s*)*\<\/TD\>/gs) { print "$1 $2 $3 $5\n"; } This matches the pattern fine and is pr...

Regular Expression to match patterns
Hi, My experience in Regex coding is very less. I have been assigned to clean up some data to make it meaningful. I got this data from SQL server: 10/8/2007 5:03:06 PM thakurab  *************RECIVED MAIL   PQR Start Time: 10/08/07 16:29:09   New PQR State: Assigned  Respond Due Date: 10/09/07 00:29:09  The pattern in this case is pretty simple. After date/time is the name and the code has to split and divide this data in a new line. Thus my renewed data will be: 10/8/2007 5:03:06 PM thakurab  *************RECIVED MAIL   PQR Start Time: 10/08/...

Generating and matching regular expressions.
I have the need to generate a large number of regular expressions from a set of specifications. The specification consists of regular expressions in another syntax. They must be transformed into perl regular expressions according to the following rules: All nested parenthesizes should be non-capturing. The pattern matches only as a prefix of the string being matched. The entire pattern is capturing. In addition, I need to separate the string being matched into two parts: The matching prefix, and rest of string. Thankful for any ideas on how to do this. All patterns used in match...

Help matching with Regular expression
How do I match a line ending in ,@@) using a regular expression? Thank you Jim wrote: > > How do I match a line ending in > ,@@) > using a regular expression? my $line = "1234567890,@@)\n"; if ($line =~ m/,@@\)$/) { print "OK\n"; } HTH, Rob Jim wrote: > How do I match a line ending in > ,@@) > using a regular expression? TIMTOWTDI: /,\@\@\)$/ -- Ruud Rob Dixon wrote: > Jim wrote: >> How do I match a line ending in >> ,@@) >> using a regular expression? > > my $line = "...

regular expression match question
------_=_NextPart_001_01C59468.04B6A917 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable A script like this: line1: $string3 =3D "bacdeabcdefghijklabcdeabcdefghijkl"; line2: $string4 =3D "xxyyzzbatttvv"; line3: print "\$1 =3D $1 \@{$-[0],$+[0]}, \$& =3D $&\n" if($string3 =3D~ /(a|b)*/); line4: print "\$1 =3D $1 \@{$-[0],$+[0]}, \$& =3D $&\n" if($string4 =3D~ //); Run and gett result: $1 =3D a @{0,2}, $& =3D ba $1 =3D @{0,0}, $& =3D=20 If I chan...

regular expression match problem
------_=_NextPart_001_01C5D9B7.6C691794 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable This is a segment of code to do string search: my $email =3D "\pineyan"; my $name =3D "\\pine"; if($email =3D~ /($name)/) { print "Found my name: $name!\n"; } and I got the following error when running: Can't find unicode character property definition via main->i or i.pl at unicode/Is/i.pl line 0 How can I get this match work? Sincerely Pine ------_=_NextPart_001_01C5D9B7.6C6...

howto simplfy this regex if ($string =~/^$match | $match | $match$/g){
------=_NextPart_000_004C_01C8ADE9.8884D110 Content-Type: text/plain; charset="gb2312" Content-Transfer-Encoding: quoted-printable Hi, How do I simplify the regex below so that it matches only the number 1, = henceforth it should return false if I match $string with $match. use strict; use warnings; my $string =3D "10 11 12 13 40"; my $match =3D 1; if ($string =3D~/^$match | $match | $match$/g){ print "match"; }else{ print "unmatch"; }; ------=_NextPart_000_004C_01C8ADE9.8884D110-- On Sun, May 4, 2008 at 1:19 AM, &l...

Web resources about - Regular expression substitution matching all possible matches - perl.perl5.porters

Import substitution industrialization - Wikipedia, the free encyclopedia
ISI is based on the premise that a country should attempt to reduce its foreign dependency through the local production of industrialized products. ...

Substitutions on the App Store on iTunes
Get Substitutions on the App Store. See screenshots and ratings, and read customer reviews.

Sensory substitution for the blind: a walk in the garden wearing The vOICe - YouTube
The vOICe gives blind people live detailed visual information about their environment: the live view of a head-mounted camera is scanned from ...

NRL 2016 interchange rule change: The players that will struggle and excel with reduced substitutions ...
THE NRL is planning to drop the number of interchanges and next season we&#8217;ll get the first taste, with substitutions reduced from 10 to ...

Healthy food substitutions: What cereal should I eat?
NOT every diet change has to feel like throwing your kitchen, eating habits, and sanity upside down. The following food swaps can make you instantly ...

Socceroos 1-0 Iraq: Holger Osieck books World Cup trip with perfect substitutions
Socceroos coach Holger Osieck says qualifying for the World Cup is a beautiful feeling and vindication of his methods.

Going Virtual: Technology Not Seen as Substitution for Touch
Online communication has become ubiquitous as more agencies and clients scale back on travel, but is something lost when business is done virtually? ...

Cooking Reference App Substitutions Goes 4.0 For iPad, iCloud And iOS 7
You may never find the need to find a substitute for Substitutions , seeing as the app has just received what is perhaps its biggest update yet. ...

Video: There's no substitution for BMW's E30 Mmmmm3
Filed under: Classics , Coupe , Performance , Videos , BMW , Luxury If you told someone that you had a friend with a four-cylinder BMW who "would ...

Facebook Will Reap Substitution of TV by Digital Ads, Says Cantor
Cantor Fitzgerald‘s Youssef Squali today reiterated his Buy rating on shares of Facebook (FB), and raising his price target to $100 from $92, ...

Resources last updated: 1/15/2016 10:48:03 AM