pulling out "a","an", "the" from beginning of strings

I need to pull out articles "a", "an", and "the" from the beginning of 
title strings so that they sort properly in MySQL.  What is the best way 
to accomplish that if I have a single $scalar with the whole title in it?

Thanks,
Tim

-- 
Tim McGeary
tim.mcgeary@lehigh.edu


0
tmm8
8/24/2004 2:04:47 PM
perl.beginners 29368 articles. 3 followers. Follow

12 Replies
1097 Views

Similar Articles

[PageSpeed] 40

--=-weUphL4mkB1WR5g+DMDc
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

On Tue, 2004-08-24 at 15:04, Tim McGeary wrote:
> I need to pull out articles "a", "an", and "the" from the beginning of=20
> title strings so that they sort properly in MySQL.  What is the best way=20
> to accomplish that if I have a single $scalar with the whole title in it?

I would go with substitutions:

$scalar =3D~ s/^(?:a|an|the)//i;

> Thanks,
> Tim
>=20
> --=20
> Tim McGeary
> tim.mcgeary@lehigh.edu
--=20
Jos=E9 Alves de Castro <cog@cpan.org>
  http://natura.di.uminho.pt/~jac

--=-weUphL4mkB1WR5g+DMDc
Content-Type: application/pgp-signature; name=signature.asc
Content-Description: This is a digitally signed message part

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)

iD8DBQBBK0xY4mCQSd1x0g0RAtQ2AJ4hwn5nXTy/o6jZbUTZUMeCEbdbGACfWKSH
2Jy7qn/acdzACrutmkStxME=
=tNHj
-----END PGP SIGNATURE-----

--=-weUphL4mkB1WR5g+DMDc--

0
jcastro
8/24/2004 2:10:36 PM
--=-A0sE7VNm/n2AMknlC2yn
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

On Tue, 2004-08-24 at 15:04, Tim McGeary wrote:
> I need to pull out articles "a", "an", and "the" from the beginning of=20
> title strings so that they sort properly in MySQL.  What is the best way=20
> to accomplish that if I have a single $scalar with the whole title in it?

I would go with substitutions:

$scalar =3D~ s/^(?:a|an|the)//i;

> Thanks,
> Tim
>=20
> --=20
> Tim McGeary
> tim.mcgeary@lehigh.edu
--=20
Jos=E9 Alves de Castro <cog@cpan.org>
  http://natura.di.uminho.pt/~jac

--=-A0sE7VNm/n2AMknlC2yn
Content-Type: application/pgp-signature; name=signature.asc
Content-Description: This is a digitally signed message part

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)

iD8DBQBBK0xc4mCQSd1x0g0RArqoAJ4o0N2xTHYFNvPbCAt9RygE/LXtowCfd+pP
s9tCrZwObr5F9hieh+ciywI=
=4jA5
-----END PGP SIGNATURE-----

--=-A0sE7VNm/n2AMknlC2yn--

0
jcastro
8/24/2004 2:10:36 PM
Jose Alves de Castro wrote:
> On Tue, 2004-08-24 at 15:04, Tim McGeary wrote:
> 
>>I need to pull out articles "a", "an", and "the" from the beginning of 
>>title strings so that they sort properly in MySQL.  What is the best way 
>>to accomplish that if I have a single $scalar with the whole title in it?
> 
> 
> I would go with substitutions:
> 
> $scalar =~ s/^(?:a|an|the)//i;

So that I am understanding this process, what does each part mean?  I 
assume that the ^ means beginning of the variable... is that correct? 
What about "(?:" ?

tyia,
Tim

0
tmm8
8/24/2004 2:16:58 PM
--=-LZt627AJFU/oEWUVtVox
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

On Tue, 2004-08-24 at 15:16, Tim McGeary wrote:
> Jose Alves de Castro wrote:
> > On Tue, 2004-08-24 at 15:04, Tim McGeary wrote:
> >=20
> >>I need to pull out articles "a", "an", and "the" from the beginning of=20
> >>title strings so that they sort properly in MySQL.  What is the best wa=
y=20
> >>to accomplish that if I have a single $scalar with the whole title in i=
t?
> >=20
> >=20
> > I would go with substitutions:
> >=20
> > $scalar =3D~ s/^(?:a|an|the)//i;
>=20
> So that I am understanding this process, what does each part mean?  I=20
> assume that the ^ means beginning of the variable... is that correct?=20
> What about "(?:" ?

The ^ means the beginning of the string in $scalar, indeed.

As for the rest, I decided to group "a", "an" and "the" with brackets,
or otherwise the regex would have been /^a|^an|^the/

Regarding the :? , that's just so variable $1 doesn't end up with
whatever was removed, as there was no need for that.

Search for "Non-capturing groupings" under perldoc perlretut, if you
need more information

> tyia,
> Tim

HTH, :-)

jac

--=20
Jos=E9 Alves de Castro <cog@cpan.org>
  http://natura.di.uminho.pt/~jac

--=-LZt627AJFU/oEWUVtVox
Content-Type: application/pgp-signature; name=signature.asc
Content-Description: This is a digitally signed message part

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)

iD8DBQBBK0474mCQSd1x0g0RAkgIAJ97X4s1VL8OZ6nrSXSPosWsSG8QQACfWy2p
F+ZIVEUS6FykOVwsPDnyXdU=
=Fm3p
-----END PGP SIGNATURE-----

--=-LZt627AJFU/oEWUVtVox--

0
jcastro
8/24/2004 2:18:35 PM
Jose Alves de Castro wrote:
> On Tue, 2004-08-24 at 15:16, Tim McGeary wrote:
> 
>>Jose Alves de Castro wrote:
>>
>>>On Tue, 2004-08-24 at 15:04, Tim McGeary wrote:
>>>
>>>
>>>>I need to pull out articles "a", "an", and "the" from the beginning of 
>>>>title strings so that they sort properly in MySQL.  What is the best way 
>>>>to accomplish that if I have a single $scalar with the whole title in it?
>>>
>>>
>>>I would go with substitutions:
>>>
>>>$scalar =~ s/^(?:a|an|the)//i;
>>
>>So that I am understanding this process, what does each part mean?  I 
>>assume that the ^ means beginning of the variable... is that correct? 
>>What about "(?:" ?
> 
> 
> The ^ means the beginning of the string in $scalar, indeed.
> 
> As for the rest, I decided to group "a", "an" and "the" with brackets,
> or otherwise the regex would have been /^a|^an|^the/
> 
> Regarding the :? , that's just so variable $1 doesn't end up with
> whatever was removed, as there was no need for that.
> 
> Search for "Non-capturing groupings" under perldoc perlretut, if you
> need more information

Great!  Thank you very much!  :)

Tim

0
tmm8
8/24/2004 2:23:55 PM
--=-8CECbx/71t2zpngdclGV
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

On Tue, 2004-08-24 at 15:39, Chris Devers wrote:
> On Tue, 24 Aug 2004, Jose Alves de Castro wrote:
>=20
> > On Tue, 2004-08-24 at 15:04, Tim McGeary wrote:
> >> I need to pull out articles "a", "an", and "the" from the beginning of
> >> title strings so that they sort properly in MySQL.  What is the best w=
ay
> >> to accomplish that if I have a single $scalar with the whole title in =
it?
> >
> > I would go with substitutions:
> >
> > $scalar =3D~ s/^(?:a|an|the)//i;
>=20
> Why not save the data for later by moving the article to the end?
>=20
>      $scalar =3D~ s/^(?:a|an|the)\s+(.*)/$2, $1/i;
>=20
> That way, "A Tale of Two Cities" should become "Tale of Two Cities, A",=20
> and if you have to reconstitute the original title later, you haven't=20
> thrown anything away...

I second this :-)

> --=20
> Chris Devers
--=20
Jos=E9 Alves de Castro <cog@cpan.org>
  http://natura.di.uminho.pt/~jac

--=-8CECbx/71t2zpngdclGV
Content-Type: application/pgp-signature; name=signature.asc
Content-Description: This is a digitally signed message part

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)

iD8DBQBBK1Jx4mCQSd1x0g0RAr1hAJoDb36M9czQMFty/bLIKcYm+MQq2QCfcno7
X7bO+dvn6QaiO+tk5GYk7+0=
=dru+
-----END PGP SIGNATURE-----

--=-8CECbx/71t2zpngdclGV--

0
jcastro
8/24/2004 2:36:33 PM
On Tue, 24 Aug 2004, Jose Alves de Castro wrote:

> On Tue, 2004-08-24 at 15:04, Tim McGeary wrote:
>> I need to pull out articles "a", "an", and "the" from the beginning of
>> title strings so that they sort properly in MySQL.  What is the best way
>> to accomplish that if I have a single $scalar with the whole title in it?
>
> I would go with substitutions:
>
> $scalar =~ s/^(?:a|an|the)//i;

Why not save the data for later by moving the article to the end?

     $scalar =~ s/^(?:a|an|the)\s+(.*)/$2, $1/i;

That way, "A Tale of Two Cities" should become "Tale of Two Cities, A", 
and if you have to reconstitute the original title later, you haven't 
thrown anything away...




-- 
Chris Devers
0
cdevers
8/24/2004 2:39:29 PM
Jose Alves de Castro wrote:
> On Tue, 2004-08-24 at 15:04, Tim McGeary wrote:
> > I need to pull out articles "a", "an", and "the" from the beginning
> > of title strings so that they sort properly in MySQL.  What is the
> > best way to accomplish that if I have a single $scalar with the
> > whole title in it? 
> 
> I would go with substitutions:
> 
> $scalar =~ s/^(?:a|an|the)//i;

Two problems:

1. This doesn't remove just the whole words; it removes parts of words as
well. i.e. "Analyzing Widgets" would become "alyzing Widgets"

2. It doesn't remove whitespace after the word, so "The Widget Primer"
becomes " Widget Primer", which won't sort with the w's, due to the leading
blank.

Perhaps:

   $scalar =~ s/^(a|an|the)\s*\b//i;

would work better.
0
Bob_Showalter
8/24/2004 3:19:38 PM
--=-aFkOmiN0AbieVrVH3lis
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

On Tue, 2004-08-24 at 16:19, Bob Showalter wrote:
> Jose Alves de Castro wrote:
> > On Tue, 2004-08-24 at 15:04, Tim McGeary wrote:
> > > I need to pull out articles "a", "an", and "the" from the beginning
> > > of title strings so that they sort properly in MySQL.  What is the
> > > best way to accomplish that if I have a single $scalar with the
> > > whole title in it?=20
> >=20
> > I would go with substitutions:
> >=20
> > $scalar =3D~ s/^(?:a|an|the)//i;
>=20
> Two problems:
>=20
> 1. This doesn't remove just the whole words; it removes parts of words as
> well. i.e. "Analyzing Widgets" would become "alyzing Widgets"
>=20
> 2. It doesn't remove whitespace after the word, so "The Widget Primer"
> becomes " Widget Primer", which won't sort with the w's, due to the leadi=
ng
> blank.
>=20
> Perhaps:
>=20
>    $scalar =3D~ s/^(a|an|the)\s*\b//i;
>=20
> would work better.

You're absolutely right. I think this is a sign that I need to go out,
eat and drink something, breath some fresh air, etc.

--=20
Jos=E9 Alves de Castro <cog@cpan.org>
  http://natura.di.uminho.pt/~jac

--=-aFkOmiN0AbieVrVH3lis
Content-Type: application/pgp-signature; name=signature.asc
Content-Description: This is a digitally signed message part

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)

iD8DBQBBK1zP4mCQSd1x0g0RAjy9AJ9yuQrbRt0Kg2O8GDJVUYTE3oguvQCeNrAq
4I8Zg6tkNeUqEsP1V6g3aR8=
=W0+U
-----END PGP SIGNATURE-----

--=-aFkOmiN0AbieVrVH3lis--

0
jcastro
8/24/2004 3:20:48 PM
Bob Showalter wrote:
> Jose Alves de Castro wrote:
> 
>>On Tue, 2004-08-24 at 15:04, Tim McGeary wrote:
>>
>>>I need to pull out articles "a", "an", and "the" from the beginning
>>>of title strings so that they sort properly in MySQL.  What is the
>>>best way to accomplish that if I have a single $scalar with the
>>>whole title in it? 
>>
>>I would go with substitutions:
>>
>>$scalar =~ s/^(?:a|an|the)//i;
> 
> Two problems:
> 
> 1. This doesn't remove just the whole words; it removes parts of words as
> well. i.e. "Analyzing Widgets" would become "alyzing Widgets"

Actually it would become "nalyzing Widgets" because 'a' is the first 
alternative.  :-)


John
-- 
use Perl;
program
fulfillment
0
krahnj
8/24/2004 9:17:01 PM
Errin Larsen wrote:
> Hey,
> 
> Ok, looking through this ... I'm confused.
> 
> << SNIP >>
> 
> > > 
> > > Perhaps:
> > > 
> > >    $scalar =~ s/^(a|an|the)\s*\b//i;
> > > 
> > > would work better.
> 
> <<SNIP>>
> 
> Is this capturing into $1 the a|an|the (yes)

Yes, but that's only a side effect. I'm not doing anything with $1.

> and the rest of the title
> into $2 (no?).

No.

>  After doing so, will it reverse the two ( i.e.
> s/^(a|an|the)\s+(.*)\b/$2, $1/i )?  

No.

> Also, what is the "\b"?

A word boundary assertion. See perldoc perlre.

>  it seems
> that the trailing "i" is for ignoring case; is that correct?

Yes.

It's not concerned with capturing anything; it's just matching a pattern and
then replacing the text matched with an empty string. The parens are used to
delimit the alternation a|an|the.

What I'm trying to match is:

   ^           beginning of line, followed by
   (a|an|the)  one of these sequences, followed by
   \s*         any amount of whitespace, followed by
   \b          a word boundary (see perldoc perlre)

The \s* is there so the whitespace following the leading word "a, an, or
the" will be removed along with the word. The \b ensures that the end of
what we capture either is at the start of a new word or is the end of the
string.

If I left off the \b, it would match the "a" in "acme", since \s* can match
the zero-length string between the "a" and the "c". With \b in there, the
match fails, because \b will not match at the "c", since it's not a word
boundary.

An alternative to \s*\b would be \s+ (i.e. match at least one whitespace
char). However, this won't match a single word title like "the", because \s+
doesn't match at the end of the string, while \s*\b does. (How such a title
should be handled is up to the OP; if it should be left alone, then \s+
would be appropriate.)

HTH
0
Bob_Showalter
8/25/2004 12:34:24 PM
John W. Krahn wrote:
> Bob Showalter wrote:
> > Jose Alves de Castro wrote:
> > 
> > > On Tue, 2004-08-24 at 15:04, Tim McGeary wrote:
> > > 
> > > > I need to pull out articles "a", "an", and "the" from the
> > > > beginning of title strings so that they sort properly in MySQL.
> > > > What is the best way to accomplish that if I have a single
> > > > $scalar with the whole title in it?
> > > 
> > > I would go with substitutions:
> > > 
> > > $scalar =~ s/^(?:a|an|the)//i;
> > 
> > Two problems:
> > 
> > 1. This doesn't remove just the whole words; it removes parts of
> > words as well. i.e. "Analyzing Widgets" would become "alyzing
> > Widgets" 
> 
> Actually it would become "nalyzing Widgets" because 'a' is the first
> alternative.  :-)

Smarty pants :~) 

My brain said "longest, leftmost", but the short-circuiting behavior is
clearly documented in perldoc perlre. If the alternation is written as
(an?|the), the "an" is matched.
0
Bob_Showalter
8/25/2004 1:01:01 PM
Reply:

Similar Artilces:

Precedence of "where" ("of", "is", "will")?
Nobody on #perl6 today could answer this one. Is: Str | Int where { $_ } the same as: (Str | Int) where { $_ } or: Str | (Int where { $_ }) ? Followup questions, Mr. President: What kind of operators are "where", "of", "is", and "will"? Is there a reason that S03 doesn't list them? What are their precedence(s)? -- Chip Salzenberg - a.k.a. - <chip@pobox.com> Open Source is not an excuse to write fun code then leave the actual work to others. Chip Salzenberg writes: &...

.ALLCOL("%COLUMN%", " ", ", ", ", ")
Do you know anyway for me to exclude a subset of columns returned by this function. We have two columns (rec_user and rec_datetime) which are in all of our tables, but when generating triggers I want automatically generate a script which does not include those two columns but does include all other columns in that table. Bruce I should add that I am using PD 9.0.0.580. Bruce "Bruce Lamb" <lamb.bruce@mayo.edu> wrote in message news:6HgI315nCHA.155@forums.sybase.com... > Do you know anyway for me to exclude a subset of columns returned by this > function. ...

quotes, quotes, quotes...
I am getting this error and I know what is causing it, but I have no idea how to fix it, any help would be great. The script steps through the /var/log/messages file on a linux server and puts The entries into a mysql database. However when it gets to the 'hlt' line in the messages file it just barfs. The single quotes are freaking it out. I know about quotes but not how to use in this situation. Thanks, Paul Error: May 27 17:53:00 localhost kernel: Checking 'hlt' instruction... OK. <----- doesn't like this in the messages file DBD::mysql::st exec...

Using "+" or "||"
Using SQLAnywhere 5.5.04, I've gotten into the habit of using "||" in ISQL to indicate a string concatenation. I needed to paste my SQL statement into the PowerBuilder script painter for some embedded SQL, and PB didn't like the "||" very much at all. I changed it to "+" and it seems to be ok. Do these two operators indicate ~exactly~ the same thing? moin, afaik these two's are not the same! if you're using "||" and any term is NULL then in the resultstring the term will be ignored if you use "+" then the resu...

Replacing "\\" with "\"
Hi all I'm getting this value from a CheckBoxList control - a location of file, i have to remove "\\" and replace it with "\" and pass it to Query, how to do it, i tried with Replace, but coud'nt suceed. "\\\\Blaze10xp\\BLZ_SFS_07\\Sample Excel Files\\Excel Files\\report2.xls" thank's in advance - Prakash.C you tried Replace like this? string newstring = oldstring.Replace(@"\\",@"\");Plese, do not forget to click "Mark as Answer" on the post that helped you. Thanx!My blog: Scenes From A Developer Memory yes i tr...

"Using" or "With"
Hi all Please can someone enlighten to me as regards the difference with the "Using" and "With" statement when accessing data - which is better, what are the limitations and/or any pointers. Many thanks. Regards DaveDavid WinchesterPlease mark as answer if this is the solution.  using gives you the ability to use the connection and it closes the connection directlly after you finish using it. and there is no need to try- cach - finaly. there is no limitation on using USING keywordMuhanad YOUNISMCSD.NETMy Blog || My Photos || LinkedIn I have a dataobject the re...

"-" not "_"
I wrote a SQL statement in the data tab. I wrote a bunch of alaises as example ' word-type ' but when I hit the layout tab it converts the "-" to "_". So now my field name is ' word_type '. Is there any way to prevent this? CardGunner Don' use a hypen ( - ).  It isn't a valid character for column names.   See http://searchsqlserver.techtarget.com/expert/KnowledgebaseAnswer/0,289625,sid87_gci1188931,00.html   Here's an excerpt about column names: Letters as defined in the Unicode Standard 2.0 Decimal numbers from either B...

double quote
hello there...  i tried everything of think but not working the way i wanted to be... not sure what i'm missing...i'm generating a <span> in code behind and then using in javascript.... here is what i'm doing code behind: int i=0string _keywordID = "keyword";string _name = row["visit_info_nm"].ToString().Trim(); String _getElementByID = String.Format("<span id='{0}' OnClick = \"document.getElementById('{1}').value='{2}';\">{3}</span><br>", i, _keywordID, _name, _name); here is what it generate : <span id='1' OnClick = \"document.getElementById('keyword')...

"Me" is better than "You"
Yes I know, strings are frozen. But let me talk about it, I really can't get through the idea of a PC talkin to me. I consider my PC as an extension of myself, not a dumb companion who addresses Me as You. Yes there are times when I get angry with Him while I work and get wrong calculations etc.., but it really is my fault, Me using wrong istructions and eventually wanting to find someone else to blame, but it's Me. And yes, I consider Thunderbird my mail program, reading my mail on my PC as Me. So I personally like to have Me in the header bar as a compact address ...

replace the "." with a ","
Oi.... I need to build a small programm in ASP.NET and chose to use C# for it.Now i got everything working but there's one little problem.the first textbox is a double. I need to make it so that when someone enters a "." then it gets replaced by a ","any ideas?Ghan  string blah = "4.2.2.2";blah = blah.Replace(".", ",");Ryan Ryan OlshanASPInsider | Microsoft MVP, ASP.NEThttp://ryanolshan.comHow to ask a question...

"To" and "From" missing
When I print emails, the words "To" and "From" are blank, even though the "To" name and "From name (addresser, addressee) do show up. This is not a problem for other users on my system. Suggestions In mailbox right click, view. On the message window, right click and choose print options. Make sure print header is checked. -- Barry Merchant NSC Volunteer SysOp *** no email unless requested please!! *** > In mailbox right click, view. On the message window, right click and > choose print options. Make sure prin...

Regular Expression to remove "/", "\", "<", ">" and "="
Can anyone please show me the regular expression to reject a string ("<blue", "right>" etc.) which has the following expression in it: "/", "\", "<", ">" and "="  hi, It may Help u.. it is in Class file u may use this expressin in validation controls also. Regex objReg = new Regex(@"^[^,.?/\~|`;:'<>]*$", RegexOptions.Singleline); Regex objReg = new Regex(@"^[^,][^.][^?][^/][^\][^~|][^`][^;][^:][^'][[^<][^>]$", RegexOptions.IgnoreCase);Thanks &...

"JROC" / "JROK" / "JROCK" / "JROQ"
I just started a new contract and the tech lead told me that he wanted me to become familiar with something called "JROC" (among some other tools). I've been searching the web and I haven't found any dev tool called "JROC." Based on the name of the tool, it sounds like it encapsulates some javascript functionality. I have tried searching for different spellings - "JROK" / "JROCK" / "JROQ" but I haven't found any matches. Have you ever heard of a dev tool by this name?...

"SSL" "Mail" and "Code"
Does anybody know of any discussions taking place within Mozilla regarding these 3 bits in the certificate manager? Perhaps I've missed something in the discussions here. In any case, I think a new mechanism for indicating trust w/in the Mozilla apps are needed. Take for example the "StartCom Certification Authority" root. The Certificate Manager (I'm using Firefox on Windows) says it can identify web sites, email users, and code yet the cert itself says it can only be used for signing other certs (essentially, that is). At a minimum this is confusing but I have...

Web resources about - pulling out "a","an", "the" from beginning of strings - perl.beginners

Resources last updated: 12/28/2015 10:26:29 AM