? should interpolating a utf8-encoded string preserve utf8ness?

Consider

my $s = 's';
utf8::upgrade($s);
my $b = ":$s:";

$b isn't in utf8.  Should it?  I suppose one can argue that it shouldn't 
matter externally.
0
public
12/13/2010 5:24:25 PM
perl.perl5.porters 48287 articles. 1 followers. Follow

5 Replies
662 Views

Similar Articles

[PageSpeed] 34

karl williamson wrote:
>Subject: ? should interpolating a utf8-encoded string preserve utf8ness?

Interpolation should have the freedom to do whatever is more convenient.
If the programmer cares about the ultimate encoding of the string,
ey should explicitly upgrade or downgrade the resulting string.

-zefram
0
zefram
12/13/2010 5:27:15 PM
On Mon, Dec 13, 2010 at 10:24:25AM -0700, karl williamson wrote:
> Consider
>
> my $s = 's';
> utf8::upgrade($s);
> my $b = ":$s:";
>
> $b isn't in utf8.  Should it?  I suppose one can argue that it shouldn't  
> matter externally.


Ideally, for the average Perl programmer it should not matter.
Unfortunally, it does. I suggest that the documentation says it's
undefined, but what is done is the least surprising for the user. I
think have $b in utf8 format is the least surprising, but I've no data,
not even anecdotical, to back that up. 



Abigail
0
abigail
12/13/2010 5:43:32 PM
On Mon, 13 Dec 2010, karl williamson wrote:
> 
> Consider
> 
> my $s = 's';
> utf8::upgrade($s);
> my $b = ":$s:";
> 
> $b isn't in utf8.  Should it?  I suppose one can argue that it shouldn't
> matter externally.

It isn't?

$ perl -MDevel::Peek -e '$a="s";Dump $a;utf8::upgrade($a);Dump $a;$b = ":$a:";Dump $a;Dump $b'
SV = PV(0x801838) at 0x825570
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK)
  PV = 0x201c00 "s"\0
  CUR = 1
  LEN = 16
SV = PV(0x801838) at 0x825570
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK,UTF8)
  PV = 0x201c00 "s"\0 [UTF8 "s"]
  CUR = 1
  LEN = 16
SV = PV(0x801838) at 0x825570
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK,UTF8)
  PV = 0x201c00 "s"\0 [UTF8 "s"]
  CUR = 1
  LEN = 16
SV = PV(0x8018d0) at 0x8255c0
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK,UTF8)
  PV = 0x2049c0 ":s:"\0 [UTF8 ":s:"]	
  CUR = 3
  LEN = 16

Tested with 5.12 and blead.  What am I missing?

Cheers,
-Jan


0
jand
12/13/2010 7:48:39 PM
--001636c5b7e3826df50497500f23
Content-Type: text/plain; charset=ISO-8859-1

On Mon, Dec 13, 2010 at 12:24 PM, karl williamson
<public@khwilliamson.com>wrote:

> Consider
>
> my $s = 's';
> utf8::upgrade($s);
> my $b = ":$s:";
>
> $b isn't in utf8.  Should it?


If you "fix" this, you should fix "use utf8;" to only produce UTF8=1 strings
from literals.

Personally, I don't think either should be changed. I can understand
avoiding the addition of optimising downgrades where there weren't any
before to avoid breaking others' bad code, but there's no reason to remove
existing optimisations.

That said, 5.12.2 does set UTF8=1 on $b, so this may be moot.

$ perl -MDevel::Peek -e'my $s = "s"; utf8::upgrade($s); my $b = ":$s:";
Dump($b)'
SV = PV(0x816a0d8) at 0x817bc78
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK,UTF8)
  PV = 0x81842f0 ":s:"\0 [UTF8 ":s:"]
  CUR = 3
  LEN = 4

- Eric

--001636c5b7e3826df50497500f23--
0
ikegami
12/13/2010 7:53:40 PM
Jan Dubois wrote:
> On Mon, 13 Dec 2010, karl williamson wrote:
>> Consider
>>
>> my $s = 's';
>> utf8::upgrade($s);
>> my $b = ":$s:";
>>
>> $b isn't in utf8.  Should it?  I suppose one can argue that it shouldn't
>> matter externally.
> 
> It isn't?
> 
> $ perl -MDevel::Peek -e '$a="s";Dump $a;utf8::upgrade($a);Dump $a;$b = ":$a:";Dump $a;Dump $b'
> SV = PV(0x801838) at 0x825570
>   REFCNT = 1
>   FLAGS = (PADMY,POK,pPOK)
>   PV = 0x201c00 "s"\0
>   CUR = 1
>   LEN = 16
> SV = PV(0x801838) at 0x825570
>   REFCNT = 1
>   FLAGS = (PADMY,POK,pPOK,UTF8)
>   PV = 0x201c00 "s"\0 [UTF8 "s"]
>   CUR = 1
>   LEN = 16
> SV = PV(0x801838) at 0x825570
>   REFCNT = 1
>   FLAGS = (PADMY,POK,pPOK,UTF8)
>   PV = 0x201c00 "s"\0 [UTF8 "s"]
>   CUR = 1
>   LEN = 16
> SV = PV(0x8018d0) at 0x8255c0
>   REFCNT = 1
>   FLAGS = (PADMY,POK,pPOK,UTF8)
>   PV = 0x2049c0 ":s:"\0 [UTF8 ":s:"]	
>   CUR = 3
>   LEN = 16
> 
> Tested with 5.12 and blead.  What am I missing?
> 
> Cheers,
> -Jan
> 
> 

My apologies for oversimplifying without checking everything.
The non-utf8 string is actually generated via:
qq[":$b:" =~ /:[_$a]:/i]
where $b has the utf8 bit set, but contains only ASCII.  The ":$b:" part 
of it doesn't have the utf8 bit set.  After staring at this for a while, 
I can see why this would not preserve utf8ness, but it wasn't obvious to 
me.  I'm sorry
0
public
12/14/2010 12:15:40 AM
Reply:

Similar Artilces:

UTF8, UTF-8, utf8, Utf8 encoding blues
Hi All, I'm reading loads, and loads of very confusing and contradicting information about UTF8 in Perl. A lot of posts are also (rightfully IMHO) stating that UTF8 is an absolute nightmare in Perl. Can someone shed some light as to what is going on here please: use Encoding; SysLog("debug", "1 - DEBUG LENGTH: " . length($Response)); my $unicode_chars = Encode::decode('utf8', $Response); SysLog("debug", "** ENCODING: " . find_encoding($Response)); my $newunicode_chars = substr($unicode_chars, 0, -3); my $Body = $newunicode...

utf8::upgrade,utf8::encode and utf8::is_utf8 on EBCDIC platform
Hi, This are the tetstcase i'm runing on EBCDIC platform, my $b = chr(0x0FF); $p=utf8::upgrade($b); print "\n$p"; utf8::upgarde returns the number of octets necessary to represent the string as UTF-X. EBCDIC output is 1 whereas ASCII platform output is 2. Is the return value i'm getting on EBCDIC is correct? my $c=chr(0x0FF); print "before $c\n"; print "\n"; utf8::encode($c); print "after $c\n"; print length($c); On ASCII before is single octet repsentation and after encode is two byte , length is 2. On EBCDIC it...

UTF8 matches in a non-UTF8 string
There might be a bug here, but I think it's a matter of philosophy. Could I have people's intuitive reactions, please: Given $a = v196.172.200 which is a non-UTF8 string, and $b = v300 which is a UTF8 string which just so happens to look like v196.172 in a byte representation, should $a =~ /^$b/ ? Should it require "use bytes" to match? Or "use utf8"? Personally, I don't think it should match at all - but it currently does. Simon ---------------------------------------------------------------- The information t...

superreview canceled: [Bug 393246] Always encode query string values as UTF-8 (network.standard-url. encode-query-utf8 = true)
D=C3=A3o Gottwald <dao@mozilla.com> has canceled D=C3=A3o Gottwald <dao@moz= illa.com>'s request for superreview: Bug 393246: Always encode query string values as UTF-8 (network.standard-url.encode-query-utf8 =3D true) https://bugzilla.mozilla.org/show_bug.cgi?id=3D393246 Attachment 284307: flip the pref https://bugzilla.mozilla.org/attachment.cgi?id=3D284307&action=3Dedit= ...

superreview requested: [Bug 393246] Always encode query string values as UTF-8 (network.standard-url. encode-query-utf8 = true)
D=C3=A3o Gottwald <dao@mozilla.com> has asked Christian :Biesinger <cbiesinger@gmx.at> for superreview: Bug 393246: Always encode query string values as UTF-8 (network.standard-url.encode-query-utf8 =3D true) https://bugzilla.mozilla.org/show_bug.cgi?id=3D393246 Attachment 284307: flip the pref https://bugzilla.mozilla.org/attachment.cgi?id=3D284307&action=3Dedit ------- Additional Comments from D=C3=A3o Gottwald <dao@mozilla.com> I was told this would really fix bug 387723.= ...

UTF8 string encoding in Indy10/D7
Hi, I'm using TCPClient and TCPServer to send text from one PC to another. When the text includes some extended western characters like the Irish name Màire, at the other end I'm receiving just Mire. For both the TCPClient and TCPServer, I have set IOHandler.DefStringEncoding := TIdTextEncoding.UTF8; Can anyone shed any light on the problem? Thanks! Ross wrote: > For both the TCPClient and TCPServer, I have set > IOHandler.DefStringEncoding := TIdTextEncoding.UTF8; Are you absolutely sure? What you describe suggests that the receiver is not using an ...

use utf8; <=> use encoding 'utf8';
Apart from the parser bug spotted earlier today, functionally (from the outside at least) and disregarding scoping issues, the following seem equivalent: use utf8; binmode( STDOUT,':utf8' ); and use encoding 'utf8'; The reason I tried the latter, was because the simple program: == simpleutf8 ======================================================== use utf8; my $string = <<EOD; élève EOD print $string; ====================================================================== produces the output: $ perl -w simpleutf8 ...

How to convert a String into a UTF8 String
Hello! I want to convert a normal String into a UTF8 String for example:myString = BüromyUTF8String = Büros  I dont know how to do this. In PHP you can do this with myUTF8String = utf8_encode(myString) Thank you for all hints! try this string test ="ÁÉÍÓÚ áéíóú àèìòù äëïöü Ññ €"; byte[] a= System.Text.Encoding.UTF8.GetBytes(test); string test2= System.Text.Encoding.UTF8.GetString(a);  greetingsBest Regards,Sebastián DopicoBlog Desarrollador Thank you Sebastián! If I make it like that - test2 looks the same as test.But I think my problem is solved. (Someone tol...

superreview requested: [Bug 239369] nsRDFXMLSerializer string do, utf8 <-> utf16 : [Attachment 145457] do utf8 in nsRDFXMLSerializer
Axel Hecht <axel@pike.org> has asked Darin Fisher (IBM) <darin@meer.net> for superreview: Bug 239369: nsRDFXMLSerializer string do, utf8 <-> utf16 http://bugzilla.mozilla.org/show_bug.cgi?id=239369 Attachment 145457: do utf8 in nsRDFXMLSerializer http://bugzilla.mozilla.org/attachment.cgi?id=145457&action=edit ...

superreview denied: [Bug 239369] nsRDFXMLSerializer string do, utf8 <-> utf16 : [Attachment 145457] do utf8 in nsRDFXMLSerializer
Darin Fisher (IBM) <darin@meer.net> has denied Axel Hecht <axel@pike.org>'s request for superreview: Bug 239369: nsRDFXMLSerializer string do, utf8 <-> utf16 http://bugzilla.mozilla.org/show_bug.cgi?id=239369 Attachment 145457: do utf8 in nsRDFXMLSerializer http://bugzilla.mozilla.org/attachment.cgi?id=145457&action=edit ------- Additional Comments from Darin Fisher (IBM) <darin@meer.net> >Index: base/src/nsNameSpaceMap.cpp > nsresult > nsNameSpaceMap::Put(const nsAString& aURI, nsIAtom* aPrefix) > { >+ NS_ConvertUTF16toUTF...

Utf8 encoding
------_=_NextPart_001_01C817B7.5CF44780 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Hello, I am parsing an xml file using libxml2. The xml file has umlauts(german = keys =FC/=F6/=E4 etc) , =B0(degree) atc as the characters. Could someone tell me how to encode such an xml to utf8?=20 I get the below error: "parser error : Input is not proper UTF-8, indicate encoding !" Below is the snippet of code that I used. my $parser =3D ""; my $doc =3D ""; $parser =3D XML::LibXML->new()...

use utf8; with bad utf8
Is this supposed to happen? perl -wle 'use utf8; %a = ("�"=>"sterling"); print ord foreach keys %a' Malformed UTF-8 character (2 bytes, need 3) at -e line 1. Possible unintended interpolation of @ܴ in string at -e line 1. Out of memory! [exit code was 1] The two characters in my malformed utf8 are 0xE1 0x80 [I believe. Meta-a Meta-space] Making my utf8 well formed (two meta spaces) and it's all happy, so that bit works. But I've no idea how the black magic in toke mixes with the utf8 black magic, so I don't know where to start on tr...

utf8.pm and the utf8 namespace
Hi, utf8.pm's POD first says that you don't have to load the module in order to use its functions. It even has in B<bold> letters that you should only use the pragma if your source is in UTF-8. But later, it says: > Note that in the Perl 5.8.0 and 5.8.1 implementation the functions > utf8::is_utf8, utf8::valid, utf8::encode, utf8::decode, utf8::upgrade, > and utf8::downgrade are always available, without a C<require utf8> > statement-- this may change in future releases. May this really change in future releases? That'll break a lot of code...

Receiving unspecified attachments via SOAP: UTF8 vs. not UTF8: Attachment filled with #0 (String handling instead of byte handling)
Hi! We're again sitting in front of a problem with Delphi 2010's SOAP. We have a method where the server CAN send an attachment along with the response to the actual request. The attachment is not specified anywhere in the accompanying XML response. That's not the problem, we have a CheckForMimeContent-Function in our RIO.AfterExecute-Handler which checks for a MIME boundary and if there's one, it creates mime := GetMimeAttachmentHandler(btMIME); and has it call ProcessMultiPartForm() on the data stream. However, the attachment that comes along is decoded impro...

UTF8
--------------ms5D28ED689AFA9B1FF125206B Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Dear All,. Now i use perl to interface LDAP. But i have some problems that LDAP's data format is UTF8 but i want to convert UTF8 to ASCII. Do you know perl have function to convert its? If you know or you have a suggestion please tell to me. Regards,. P. Kumsaikaew ===================================================== Piyamart Kumsaikaew National Electronics and Computer Technology Center (NECTEC) Ministry of Science Technology and Environment, Tha...

utf8
Doing cross-compilation from Cross directory. miniperl already done. Now this error: "Can't locate unicore/PVA.pl in @INC" There isn't unicore/PVA.pl in the source. Can i build perl without utf8 support and how? On Sun, Nov 21, 2004 at 05:26:17PM +0200, gumbold <gumbold@bonbon.net> wrote: > Doing cross-compilation from Cross directory. > miniperl already done. > Now this error: > > "Can't locate unicore/PVA.pl in @INC" > > There isn't unicore/PVA.pl in the source. You appear to not be doing everything that ...

UTF8
Does anybody know how to catch UTF8 characters? Barry Jones DATABUILT, Inc. The Global AEC Information Company 1476 Fording Island Rd. Bluffton, SC 29910 (843) 836-2166 office "Life is like a dogsled team; if you ain't the lead dog, the scenery never changes." - Lewis Grizzard ...

UTF8
Does anybody know how to catch UTF8 characters coming in from a text box. I've been getting a lot of them from people cutting and pasting information. Barry Jones DATABUILT, Inc. The Global AEC Information Company 1476 Fording Island Rd. Bluffton, SC 29910 (843) 836-2166 office "Life is like a dogsled team; if you ain't the lead dog, the scenery never changes." - Lewis Grizzard Not sure what you mean by UTF8 characters. Do you mean those in the 128-255 range (corresponding to the high half of the ASCII set), such as the accented characters and so forth? ...

utf8
hi, I am trying to use perl's Net::LDAP module to manipulate data in eDirectory 8.6.2. We are located in Scandinavia and have many attributes that include utf8 characters. use utf8; use Net::LDAP; use Net::LDAP::LDIF; use Unicode::String qw(latin1 utf8); The following ldap search works fine, and prints output in the desired latin1 charset: $mesg = $ldap->search ( base => "o=org", filter => "(&(objectclass=user)(cn=$cn))" ); foreach $entry ($mesg->...

UTF8
Powerbuilder 703 10108 Is it possible to read data from a UTF txt. file and put data into a database table? If not. Will pb11 manage this? Roger Nyg�rd I would think you would need PowerBuilder 10 or higher since these are the Unicode aware versions and have capabilities to read and convert the different encodings. I would guess you could come up with a workaround using OLE to have third party component do the conversion. Anyone have any ideas or sample code. Doug Porter DailyAccess Corporation "Roger Nyg�rd" <roger@askit.no> wrote in message ne...

Working with utf8 and non-utf8 clients
HI I�m new and looking for consultation � I have Novell Netware 6.5 with sp7. Clients running under DOS, Windows 95-98-Me-XP-Vista. Server codepage English, station codepage Polish Because of old clients utf8 encoding (windows9x clients don�t support UTF8) was disabled. Now want use utf8 on clients side (vista client haven�t option to disable UTF8) and starts problems witch invisible files and folders � I did try solve it in testing environment by: 1. Change server codepage from cp850 to cp852 (polish) in startup.ncf. 2. Change l_config for Polish. 3. Use the NSSCPT utility to correct invalid filenames and directories. After this operation utf8/non-utf8 clients are able to works good � looks like all problems gone. Now I prepare to apply this solution to production environment but I�m not sure of this solution. Is this method good or maybe someone have better? thx -- Sajgon ------------------------------------------------------------------------ Note that the invisible file problem can be fixed with the following console command: nss /ncpdisplaynontranslatablenames Independently of your server code page, I would stringly recommend to always use this option in mixed utf8 / non-utf8 environments. -- Marcel Cox (Sysop) Discover the new Novell forums at http://forums.novell.com ...

more UTF8 test suites and an UTF8 patch
--------------B4DE3C9D378C82F5A85C2570 Content-Type: text/plain; charset=iso-2022-jp Content-Transfer-Encoding: 7bit Attached are UTF8 test suite and an UTF8 patch for perl@8223. The files in test suite are: t/op/subst_utf8.t, t/op/substr_utf8.t, t/op/regexp_utf8.t + t/op/re_tests.utf8 They are converted from t/op/{subst.t,substr.t,regexp.t,re_tests} simply translating ascii characters to unicode characters. (In fact, they are "FULLWIDTH" characters code FF01-FF5E) The files are UTF8-encoded so you need an UTF8 capable editor/terminal to see it. perl@822...

To Encode String or to Not Encode String... Answer?
I am importing a CSV file from a Unix server as a string, and then saving the string into a new file for later processing.  However, this process does not always work.  As this CSV file is VERY, VERY large, I am unable to immediately tell if the fault lies with my code, or the CSV file provider (very possible).  The only time I can tell that this happened is when the data is not populated the next morning.  The fault appears to lie in there being extra or missing commas, as when I try to run the data import manually, STRINGs are attempted to be inserted into INT...

[PATCH Encode.xs] Encoded bytes -> UTF8 infrastructure
Adding bytes_to_utf8 encoding infrastructure to Encode --- perl/ext/Encode/Encode.xs.~1~ Sat Sep 16 20:21:57 2000 +++ perl/ext/Encode/Encode.xs Sat Sep 16 20:21:57 2000 @@ -2,12 +2,49 @@ #include "perl.h" #include "XSUB.h" +typedef U8 (*map_t) (pTHXo_ U8); + #define UNIMPLEMENTED(x,y) y x (SV *sv, char *encoding) { \ Perl_croak(aTHX_ "panic_unimplemented"); \ return (y)0; /* fool picky compilers */ \ - } + } + +map_t _get_map (char *encoding) { + return NULL...

Web resources about - ? should interpolating a utf8-encoded string preserve utf8ness? - perl.perl5.porters

Setting HTTP request headers in haproxy and interpolating variables
... the way to do this is : reqadd X-Custom-Header:\ some_string However, some_string is just a static string, and I could see no way of interpolating ...

Jay-Z Previews Lyrics From Nirvana-Interpolating Justin Timberlake Collab “Holy Grail”
It was already revealed that one of Jay-Z ‘s upcoming tracks on his Samsung-cosigned opus Magna Carter Holy Grail would be interpolating Nirvana ...

Resources last updated: 3/7/2016 11:32:00 AM