How to handle unicode strings in utf8 and pre-utf8 pragma perls

Hello

I'd be grateful if someone could help me with this, as I know very little
about Unicode.

I currently have Unicode data stored as bytes, or escaped depending on the
perl version, eg:

  my @day_names;

  if ( $] >= 5.006 )
  {
    @day_names =
    (
      "\x{0414}\x{04af}\x{0439}\x{0448}\x{04e9}\x{043c}\x{0431}\x{04af}",
      "\x{0428}\x{0435}\x{0439}\x{0448}\x{0435}\x{043c}\x{0431}\x{0438}",
      "\x{0428}\x{0430}\x{0440}\x{0448}\x{0435}\x{043c}\x{0431}\x{0438}",
      "\x{0411}\x{0435}\x{0439}\x{0448}\x{0435}\x{043c}\x{0431}\x{0438}",
      "\x{0416}\x{0443}\x{043c}\x{0430}",
      "\x{0418}\x{0448}\x{0435}\x{043c}\x{0431}\x{0438}",
      "\x{0416}\x{0435}\x{043a}\x{0448}\x{0435}\x{043c}\x{0431}\x{0438}"
    );
  }
  else
  {
    @day_names =
    (
      'Дүйшөмбү',
      'Шейшемби',
      'Шаршемби',
      'Бейшемби',
      'Жума',
      'Ишемби',
      'Жекшемби'
    );
  }

What I would really like to do is avoid this duplication by using byte
representations only and flagging them as Unicode if perl 5.006 or better
is used.

Conceptually something like:

  use utf8 if $] >= 5.006;    # Yes, I know this won't even compile in
                              # reality :)

  my @day_names =
  (
    'Дүйшөмбү',
    'Шейшемби',
    'Шаршемби',
    'Бейшемби',
    'Жума',
    'Ишемби',
    'Жекшемби'
  );

Of course that won't work, but that's the kind of thing I'm aiming for.

So, a couple of questions:

1) Does what I'm trying to do make sense?
2) Is there an easy way of doing it?

Any help would be really appreciated.

Cheers,
Rich
-- 
Richard Evans
scriptyrich@yahoo.co.uk
0
scriptyrich
5/31/2003 1:33:28 AM
perl.unicode 837 articles. 0 followers. Follow

4 Replies
560 Views

Similar Articles

[PageSpeed] 52

I can't help you on the important questions, but

On Sat, May 31, 2003 at 01:33:28AM +0000, Richard Evans wrote:

> Conceptually something like:
> 
>   use utf8 if $] >= 5.006;    # Yes, I know this won't even compile in
>                               # reality :)

use if $] >= 5.006, utf8;

On CPAN as http://search.cpan.org/author/ILYAZ/if-0.01000001/
In the core since 5.8.0

Nicholas Clark
0
nick
5/31/2003 7:59:35 AM
If I understand Nicholas Clark's suggestion, it would mean that for any 
perl version prior to 5.8.0, the script won't compile unless "if.pm" 
has been installed from CPAN.

The fact that "if.pm" exists and is usable on older perl5 versions is
really good news, but it still might be a hurdle for some users who
depend on remote web-server sys-admins (or other uncontrollable forces)
for perl support...

In any case, one work-around for handling utf8 text in a version-neutral 
way would be to store this text in a file, not hard-coded into the perl 
script; then decide how to read the file, depending on the version; e.g.

 open( DAYS, "day_names.utf8" );
 binmode( DAYS, ":utf8" ) if ( $] >= 5.008 );
 @day_names = <DAYS>;
 close DAYS;

Depending on what you do with the data elsewhere in your script, I'm not
sure whether 5.6 will treat the data as utf8 characters when read from 
a file like this (5.6 does not support "binmode ':utf8', FH"), but 
there's a good chance that it will work.

You can also attach this text content at the end of your script, in a 
__DATA__ segment, and set DATA as the file handle in the code sample 
shown above (rather than DAYS).

Of course even using __DATA__, it can get tedious and hard to maintain
if you have a lot of little string constants scattered throughout.

	Dave G.

(P.S.: for some reason, three of the characters in your first string
didn't map to proper Cyrillic code points for me: \u04e9 and the two
occurrences of \u04af -- I don't know the language, but were those 
typos?)

0
graff
5/31/2003 9:48:52 AM
David Graff wrote:

> If I understand Nicholas Clark's suggestion, it would mean that for any
> perl version prior to 5.8.0, the script won't compile unless "if.pm"
> has been installed from CPAN.
> 
> The fact that "if.pm" exists and is usable on older perl5 versions is
> really good news, but it still might be a hurdle for some users who
> depend on remote web-server sys-admins (or other uncontrollable forces)
> for perl support...

But as the modules I'm writing are not core perl modules, they'd have to be
installed anyway - I guess that's a problem whatever way I do it.

I've got to say if.pm looks like a brilliantly simple way of handling my
problem.

> In any case, one work-around for handling utf8 text in a version-neutral
> way would be to store this text in a file, not hard-coded into the perl
> script; then decide how to read the file, depending on the version; e.g.
> 
>  open( DAYS, "day_names.utf8" );
>  binmode( DAYS, ":utf8" ) if ( $] >= 5.008 );
>  @day_names = <DAYS>;
>  close DAYS;
> 
> Depending on what you do with the data elsewhere in your script, I'm not
> sure whether 5.6 will treat the data as utf8 characters when read from
> a file like this (5.6 does not support "binmode ':utf8', FH"), but
> there's a good chance that it will work.
> 
> You can also attach this text content at the end of your script, in a
> __DATA__ segment, and set DATA as the file handle in the code sample
> shown above (rather than DAYS).
> 
> Of course even using __DATA__, it can get tedious and hard to maintain
> if you have a lot of little string constants scattered throughout.

Thanks - these are useful ideas which I'll use in some other modules I'm
doing, but if.pm just feels right for what I'm trying to do ATM.

> (P.S.: for some reason, three of the characters in your first string
> didn't map to proper Cyrillic code points for me: \u04e9 and the two
> occurrences of \u04af -- I don't know the language, but were those
> typos?)

Ah, I picked the example at random - I'm using data from the OpenI18N/ICU
locales, and looking at the Kirghiz locale using the IBM ICU
LocaleExplorer:

  http://oss.software.ibm.com/cgi-bin/icu/lx/en/utf-8/?_=ky

I see the same result - it also says:

"Note: You're viewing an experimental locale. This locale is not part of the
official ICU installation! Please do not file bugs against this locale"

At the top, so who knows!

I hate having to use languages that I don't understand and, based off
feedback so far, there are problems with the ICU data as it stands. 

But I suppose a "comprehensive" set of locale date modules consisting of
English and basic French wouldn't be quite so useful ;->

Thanks for the feedback,
-- 
Richard Evans
scriptyrich@yahoo.co.uk
0
scriptyrich
5/31/2003 1:46:44 PM
Nicholas Clark wrote:

> I can't help you on the important questions, but
> 
> On Sat, May 31, 2003 at 01:33:28AM +0000, Richard Evans wrote:
> 
>> Conceptually something like:
>> 
>>   use utf8 if $] >= 5.006;    # Yes, I know this won't even compile in
>>                               # reality :)
> 
> use if $] >= 5.006, utf8;
> 
> On CPAN as http://search.cpan.org/author/ILYAZ/if-0.01000001/
> In the core since 5.8.0
> 
> Nicholas Clark

Good point - I forgot all about if.pm, and it looks like that might be the
perfect solution.

Thanks for your help,
Rich
-- 
Richard Evans
scriptyrich@yahoo.co.uk
0
scriptyrich
5/31/2003 1:57:37 PM
Reply:

Similar Artilces:

Receiving unspecified attachments via SOAP: UTF8 vs. not UTF8: Attachment filled with #0 (String handling instead of byte handling)
Hi! We're again sitting in front of a problem with Delphi 2010's SOAP. We have a method where the server CAN send an attachment along with the response to the actual request. The attachment is not specified anywhere in the accompanying XML response. That's not the problem, we have a CheckForMimeContent-Function in our RIO.AfterExecute-Handler which checks for a MIME boundary and if there's one, it creates mime := GetMimeAttachmentHandler(btMIME); and has it call ProcessMultiPartForm() on the data stream. However, the attachment that comes along is decoded impro...

is it utf8 or unicode?
Lo all, I was wondering if someone could help me out with this little problem. A large part is probably down to my ignorance, anyway... I have the following small script: #!/usr/bin/perl -w use Encode qw(is_utf8 _utf8_on encode_utf8 decode_utf8 decode encode); use Devel::Peek; my $data = "\xC3\x84"; _utf8_on($data); print 'IS: ', is_utf8($data)?1:0,"\n",'ORD: ', ord $data, "\n"; print 'LENGTH: ', length $data, "\n"; print 'PEEK: ', Dump($data); open FH1, "> file"; binmode FH1, ":raw"...

To unicode or not to unicode
Hi. I have a hard choice on my hands. I'm not actually sure if I should use n data types or normal ones. I need to be able to use several different languages inside a single string (long articles written in French, English, partially German and Swedish). I'm developing my ASP.NET application using UTF-8 for all data to ensure I can display all languages. I'm in doubt what types to use on the SQL server when I store data... n-something like nchar or normal types like char. Also, that N prefixing all inputs sounds messy :-)(http://support.microsoft.com/?kbid=239530)Also should I use...

D2010: How to write a unicode string to TMemoryStream as UTF8 bitstream?
Sorry guys if this a repeat question, but I have been struggling for two hours trying to get my head round this... How do you write a string (UnicodeString) to a TMemoryStream so that the stream data ends up as a UTF-8 bitstream? Many thanks in advance... > How do you write a string (UnicodeString) to a TMemoryStream so that > the stream data ends up as a UTF-8 bitstream? Assign it to a UTF8String, and pass out the resulting character data - {code} procedure WriteUTF8(Stream: TStream; const Str: string; SendBOM: Boolean = False); const UTF8BOM: array[1..3] of Ansi...

Converting std::string app to Unicode
(Not non-technical but not sure which forum to post on). In my application originally developed in C++Builder 5, I mostly use "std::string" for strings; and I only convert to AnsiString at the point of interfacing with VCL functions. Now when porting to XE5, I need to be able to interface with UnicodeStrings. Generally speaking - is it better that I change all my code to use "std::wstring" ; or should I continue to use "std::string" but do to/from UTF-8 conversion at the point of interfacing with UnicodeString? Matt wrote: > Generally speaking - ...

[PATCH B::Deparse] utf8 literal strings (and possibly a unicode/regex bug)
This patch adds support for UTF8 characters in literal strings. (I noticed that regexen behave *very* oddly if you use ridiculously large codepoints: my $x="\x{12345678}"; printf "len=%d; ord=0x%x\n", length($x), ord($x); $x =~ s/(.)/$1/g; printf "len=%d; ord=0x%x\n", length($x), ord($x); The results of that look very broken to me. Is it a known limitation, or an exciting new bug?) .robin. --- perl@9718-robin/ext/B/B.xs Thu Apr 5 04:53:18 2001 +++ perl-robin/ext/B/B.xs Tue Apr 17 21:38:48 2001 @@ -911,6 +911,7 @@ CODE: ST(0) ...

UTF8 matches in a non-UTF8 string
There might be a bug here, but I think it's a matter of philosophy. Could I have people's intuitive reactions, please: Given $a = v196.172.200 which is a non-UTF8 string, and $b = v300 which is a UTF8 string which just so happens to look like v196.172 in a byte representation, should $a =~ /^$b/ ? Should it require "use bytes" to match? Or "use utf8"? Personally, I don't think it should match at all - but it currently does. Simon ---------------------------------------------------------------- The information t...

Synapse , Unicode and UTF8
hi i've been using Synapse for a long time now with D2007. what i did was using UTF8Encode, UTF8Decode functions along with TNT components (widestring). in the Header.CharsetCode i placed UTF8 and life was great. now i want to convert my project to D2009. i copied the new synapse that supports D2009 but it seems that UTF8Encode doesnt do anything to the string passed to it so even when i set Header.CharsetCode := UTF8; when encoding the message (headers) it doesnt add the correct header syntax for UTF8 conversion but simply leaves the subject as it is. the result is that the receiver...

Handling a utf8 string.
Dear all, # This mail is in UTF-8. I have a question about handling a string of characters in UTF-8 on Perl 5.6. I wrote a script quoted below: #!perl -w use utf8; $a = '摩訶吠室&M004651;末那野提婆喝&M004651;闍陀羅尼儀軌'; $a =~ s{&M(\d\d\d)(\d\d\d);} {<IMG src="http://www.mojikyo.gr.jp/gif/$1/$1$2.gif">}g; print "$a\n"; __END__ This script results: 摩訶吠室<IMG src="http://www.mojikyo.gr.jp/gif/004/004651.gif">末 那野提婆喝<IMG src="http://www.mojikyo.gr.jp/gif/004/004...

Re: is it utf8 or unicode?
On Wed, Mar 16, 2005 at 10:23:01AM +0000, unicode@ftumsh.demon.co.uk wrote: > LANG is set to en_GB. > With some messing about I have managed to create an en_GB.utf8. > Setting LANG to that makes no difference to the perl output, as does setting LC_ALL. > Mind you, I should hope it wouldn't as :raw ignores locale, apparently. > > In a nutshell, the code below should put \xc3\x84 into the output file and > not \xc4 as it is doing. Well, I presume it should and no one is saying otherwise. No, it shouldn't put the bytes \xc3\x84 into the file (Except on p...

? should interpolating a utf8-encoded string preserve utf8ness?
Consider my $s = 's'; utf8::upgrade($s); my $b = ":$s:"; $b isn't in utf8. Should it? I suppose one can argue that it shouldn't matter externally. karl williamson wrote: >Subject: ? should interpolating a utf8-encoded string preserve utf8ness? Interpolation should have the freedom to do whatever is more convenient. If the programmer cares about the ultimate encoding of the string, ey should explicitly upgrade or downgrade the resulting string. -zefram On Mon, Dec 13, 2010 at 10:24:25AM -0700, karl williamson wrote: > Consider > >...

unicode (utf8) DBI and MySQL
Hi List I've been using DBI & MySQL for some time now and have decided to try and use unicode so that my web apps can be multilingual. I'm trying to work out getting data into and out of MySQL with utf 8. I'm inserting the data like this: I've got a hash in the following format: my %uni =3D ( =09hebrew_alef =3D> { =09=09=09character =3D> chr(0x05d0), =09=09=09language =3D> "hebrew", =09}, =09recenu =3D> { =09=09=09character =3D> "re\x{e7}enu", =09=09=09language =3D> "french", =09}, ); and I'm in...

D6 Unicode / UTF8 related
Howdi I have hit a snag... MY latest super duper project is nearing completion however it's pulling data from a remote website DB (via a php script) and that data is in UTF8. My £ signs become Á£ in my TListView. What can I do to get around that? Thanks, JD Jamie Dale wrote: > Howdi > > I have hit a snag... MY latest super duper project is nearing > completion however it's pulling data from a remote website DB (via a > php script) and that data is in UTF8. > > My £ signs become Á£ in my TListView. > ...

Database collation
Does Sybase SQL Anywhere 8 really support Unicode? I know that the "N" (stands for National) data types, SQL Server does, don't specifically mean Unicode in Sybase SQL Server. But, what the special data type Sybase SQL Server use for Unicode? You should read the following reference to get an understanding of how ASA works with unicode. Adaptive Server Anywhere Database Administration Guide 10. International Languages and Character Sets You question gets a little confusing so let me ask some clarifying questions 1) What "N" do you mean when you dis...

[PATCH] unicode/utf8 pod
--SUOF0GtieIMvvwua Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Here's my work in progress. Attached is the clean diff. In the message body is an annotated version. > - print v9786; # prints UTF-8 encoded SMILEY, "\x{263a}" > + print v9786; # prints SMILEY, "\x{263a}" The encoding for output depends on the effective layers. Outputting "wide characters" without specifying an encoding is considered wrong, and indeed does emit a warning. > -(S utf8) (F) Perl detected somethin...

Web resources about - How to handle unicode strings in utf8 and pre-utf8 pragma perls - perl.unicode

Resources last updated: 1/18/2016 4:28:24 AM