Am I correct in thinking that the only way to get ord() to return a value over 256 is to send the character as a Unicode string instead of a byte string?

In other words, is there any character that will make ord() return over  =
256 when passed in as a byte string?

For example, note the differences in output between a unicode string and =
a byte string regarding character 257, as a unicode string it is 257, as =
a byte string it is 196.

$ perl -C6 -le 'print "Character 257 info:";print "\tunicode \\x{} =
notation: " . sprintf(q{\x{%x}}, 257);print "\tOutput as Unicode string =
\x{101}";print "\tunicode string \\x{} notation ord(): " . =
ord("\x{101}");print "\tbyte string grapheme ord(): " . ord =
"\xc4\x81";print "\tbyte string literal ord(): " . ord "=C4=81";'
Character 257 info:
	unicode \x{} notation: \x{101}
	Output as Unicode string =C4=81
	unicode string \x{} notation ord(): 257
	byte string grapheme ord(): 196
	byte string literal ord(): 196
$

The reason this is relevant is that on a given project I am using =
byte-strings-only for consistency and some encoders (i.e. =
Scalar::Quote::Q() )will change from =
bytes-string-friendly-grapheme-cluster notation (e.g. \xE3\x8A\xB7)  to =
unicode-string-notation (e.g. \x{32B7}) and I want to be sure I always =
use data that gets me  the former rather than the latter :)

TIA!

--
Dan Muey=
0
dan (1)
10/28/2010 7:54:33 PM
perl.unicode 837 articles. 0 followers. Follow

4 Replies
971 Views

Similar Articles

[PageSpeed] 28
Get it on Google Play
Get it on Apple App Store

Dan Muey schrieb am 28.10.2010 um 14:54 (-0500):

> Am I correct in thinking that the only way to get ord() to return a
> value over 256 is to send the character as a Unicode string instead of
> a byte string?

Yes.

> In other words, is there any character that will make ord() return
> over  256 when passed in as a byte string?

If you pass a character as a byte string, then it's a byte string of 8
bits per byte, and the maximum for a byte is 255.

> For example, note the differences in output between a unicode string
> and a byte string regarding character 257, as a unicode string it is
> 257, as a byte string it is 196.

Yes.

  perl -Mutf8 -lwe 'print ord "Я"'  # 1071
  perl        -lwe 'print ord "Я"'  #  208

> The reason this is relevant is that on a given project I am using
> byte-strings-only for consistency and some encoders (i.e.
> Scalar::Quote::Q() )will change from
> bytes-string-friendly-grapheme-cluster notation (e.g. \xE3\x8A\xB7)
> to unicode-string-notation (e.g. \x{32B7}) and I want to be sure I
> always use data that gets me  the former rather than the latter :)

Well, if you don't need character operations, it might work for you.
Make sure to track whether or not your data is already encoded, and also
to use the correct encoding.

-- 
Michael Ludwig
0
milu71
10/28/2010 10:27:16 PM
On Oct 28, 2010, at 5:27 PM, Michael Ludwig wrote:

> Dan Muey schrieb am 28.10.2010 um 14:54 (-0500):
>=20
>> Am I correct in thinking that the only way to get ord() to return a
>> value over 256 is to send the character as a Unicode string instead =
of
>> a byte string?
>=20
> Yes.
>=20
>> In other words, is there any character that will make ord() return
>> over  256 when passed in as a byte string?
>=20
> If you pass a character as a byte string, then it's a byte string of 8
> bits per byte, and the maximum for a byte is 255.
>=20
>> For example, note the differences in output between a unicode string
>> and a byte string regarding character 257, as a unicode string it is
>> 257, as a byte string it is 196.
>=20
> Yes.
>=20
>  perl -Mutf8 -lwe 'print ord "=D0=AF"'  # 1071
>  perl        -lwe 'print ord "=D0=AF"'  #  208

Thanks for all of that Michael, now I can rest easier! An educated =
assumption is better than a anecdotal guess.

>> The reason this is relevant is that on a given project I am using
>> byte-strings-only for consistency and some encoders (i.e.
>> Scalar::Quote::Q() )will change from
>> bytes-string-friendly-grapheme-cluster notation (e.g. \xE3\x8A\xB7)
>> to unicode-string-notation (e.g. \x{32B7}) and I want to be sure I
>> always use data that gets me  the former rather than the latter :)
>=20
> Well, if you don't need character operations, it might work for you.
> Make sure to track whether or not your data is already encoded, and =
also
> to use the correct encoding.
>=20
> --=20
> Michael Ludwig

Yeah, it is a pretty strict environment where encoding is strictly =
handled and always utf-8, the code won't ever `use utf8`, and the =
strings in question will only be output (i.e. no character operations).

Again thank you very much!

--
Dan Muey=
0
dan
10/28/2010 10:59:14 PM
* Dan Muey <dan@cpanel.net> [2010-10-28 21:55]:
> For example, note the differences in output between a unicode
> string and a byte string regarding character 257, as a unicode
> string it is 257, as a byte string it is 196.

That is not what’s going on.

    $ perl -E'say ord "1234"'
    49

When you pass a multi-character string to `ord`, you get the code
point of the first character.

    $ perl -E'say chr 49'
    1

In your case you get 196. That is 0xC4, or the character Ä. It is
not the character ā (U+101 = code point 257).

0xC4 is the value of the first byte in the two-byte UTF-8
sequence that encodes the character 257. You are passing a string
containing a representation of those bytes as two characters to
`ord`, and `ord` is giving you the code point of the first
byte-as-character.

You are missing the rest of the bytes from the UTF-8 encoding.

You are losing data.

If you try this on more code points you will find that there are
*lots* of different characters that are reported as 196 – because
they get encoded as multi-byte sequences that all start with the
byte value 0xC4.

-- 
*AUTOLOAD=*_;sub _{s/::([^:]*)$/print$1,(",$\/"," ")[defined wantarray]/e;chop;$_}
&Just->another->Perl->hack;
#Aristotle Pagaltzis // <http://plasmasturm.org/>
0
pagaltzis
10/29/2010 7:30:37 AM
On Oct 29, 2010, at 2:30 AM, Aristotle Pagaltzis wrote:

> * Dan Muey <dan@cpanel.net> [2010-10-28 21:55]:
>> For example, note the differences in output between a unicode
>> string and a byte string regarding character 257, as a unicode
>> string it is 257, as a byte string it is 196.
>=20
> That is not what=E2=80=99s going on.
>=20
>    $ perl -E'say ord "1234"'
>    49
>=20
> When you pass a multi-character string to `ord`, you get the code
> point of the first character.

Thank you for clarifying what I was highlighting.=20

> You are missing the rest of the bytes from the UTF-8 encoding.
>=20
> You are losing data.

Thanks, I do understand that and appreciate you expounding it for me =
further. Allow me to explain why this question came up:

I am using Scalar::Quote on byte strings and it uses ord() to determine =
if it will use byte string grapheme notation (e.g. \xE3\x8A\xB7) or =
unicode string notation (e.g. \x{32B7}).

multivac:~ dmuey$ perl -MScalar::Quote=3DQ -E 'say Q("Perl is the =
=E3=8A=B7=E2=84=A2");'
"Perl is the \xe3\x8a\xb7\xe2\x84\xa2"
multivac:~ dmuey$=20

multivac:~ dmuey$ perl -E 'say "Perl is the \xe3\x8a\xb7\xe2\x84\xa2";'
Perl is the =E3=8A=B7=E2=84=A2
multivac:~ dmuey$

It appears to do what I need assuming 2 things:
 a) the string is a byte string=20
     (e.g. perl -MScalar::Quote=3DQ -E 'say Q("Perl is the =
\x{32b7}\x{2122}");')
 b) we are not under "use utf8"
     (e.g. perl -MScalar::Quote=3DQ -E 'use utf8; say Q("Perl is the =
=E3=8A=B7=E2=84=A2");')

 I just wanted to verify that it's use of ord() in it's logic wouldn't =
unexpectedly  result in me getting back \x{32B7} under some weird =
circumstance I overlooked.

Thanks again, everyone. I really appreciate it!

--
Dan Muey=
0
dan
10/29/2010 1:19:07 PM
Reply:

Similar Artilces:

What is the correct way to concatenate unicode Strings? [Edit]
Hello All, I have the following function in an application that does what it is suppossed to do. {code}void __fastcall TForm1::SearchDir (String Dir, String BaseDir) { // the function searches the directory for files and subdirectories. If // // it finds a file, the relative path+name is added to the lbxCueList. // // The function ends when no more files are found. // int done, pos; String newFile, dFile, sFFblock, szTmpF...

RSA Encryption
Here is my goal:1. Take a string2. Encrypt it 3. Pass it as a parameter in the QueryString4. Decrypt it The value starts as a string, then is converted to a byte[] and then encrypted.  The resulting byte[] is converted to a string and send as a parameter.  The recieving page decrypts the string (creates a byte[], decrypts to a new byte[], and the value is finally parsed for its values) I am using http://msdn.microsoft.com/library/default.asp?url=/library/en-us/cpref/html/frlrfsystemsecuritycryptographyrsacryptoserviceproviderclasstopic.asp with a small modification t...

Unicode characters to strings
In Delphi 2009, how do you assign Unicode characters to strings? I'm trying button1.caption := chr( $1234); but it shows up as a small square, except for low numbers. I know chr is supposed to operate on bytes, so is there a Unicode equivalent of chr? -- Replace you know what by j to email On Tue, 12 May 2009 21:57:27 -0700, Jud McCranie <youknowwhat.mccranie@comcast.net> wrote: >In Delphi 2009, how do you assign Unicode characters to strings? Or maybe there is something to do in Windows (XP) to make it display correctly? -- Replace you know what by j to email...

Convert Unicode string to Readable String
Novell Identity Manager 3.5.1 RedHat Linux enterprise edition. SOAP IDM driver.. When I query on an attribute "firstName" from IDM, the application returns value on that atrribute as the "English\u00c6". It looks like the string comes as a Unicode string.. I need help to reformat that value so that it can be readable for my rules in my driver.. Any help guys? Regards, M. -- love anything that talks binary! ------------------------------------------------------------------------ On Fri, 21 May 2010 08:26:01 +0000, belaie wrote: >...

Stripping unicode characters from a string
Is there any way to remove foreign characters from a string? For instance, if a string contained some alphabetical characters and some chinese symbols, is there a way to remove all the chinese symbols from the string leaving behind the alphabetical?   Any help would be greatly appreciated. Thanks. I had a problem where I had to remove rogue commas from strings within a CSV file.  Maybe the idea here will help you, but instead of looking for commas, search and replace for the characters you are after..? //Here we are reading through the line, character by character and if the...

Insertpicture(String Title, String Username, Byte[] Newimage), returns Int32 <- i dont want this returns int32
hi guys   in my dal method, i keep getting returns int32 in my code.....i dont know why it is popping up in my insert dal but it is making life hard! chuChu = Noob, please help Could you post the InsertPicture method?...

AutoCompleteExtender does not return exact string values provided by string[]
 Hi,I need the exact string in my inputfields but unfortunately the  AutoCompleteExtender seems to transform 010490 into 10490 524514-001 into 52513Any suggestions? In your webmethod that the service uses, place the return value between ' quotes... For example: [WebMethod]    [ScriptMethod]    public string[] GetCompletionList(string prefixText, int count)    {        string sql = "Select DiSTINCT TOP 20 ItemNumber from Items Where ItemNumber like @prefixText";    ...

Passing unicode strings to prepare method and other unicode questions
Hi, Increasingly I am getting asked unicode questions and being presented with unicode issues that currently don't work in DBD::ODBC. Currenty DBD::ODBC supports the binding of unicode parameters and the returning of unicode result-set data. I would like to change DBD::ODBC to support: a) unicode column names (from NAME attribute, column_info etc) b) unicode connection strings c) unicode SQL d) unicode table names (table_info etc) Although I don't specifically need unicode connection strings I at least need to turn connection strings usually passed to SQLDriverCo...

copying a Unicode string to a char[] string generates an E2034
The following code generates a String UtilityClass::parse( const String string, const String delim ) { String results = ""; if( string.Length() > 4096 ) { warning("Parse buffer exceeded."); return ""; } _tcscpy(parseBuffer, string.c_str());>>>>>>>generates an error [BCC32 Error] common.cpp(589): E2034 Cannot convert 'wchar_t *' to 'const char *' [BCC32 Error] common.cpp(589): E2342 Type mismatch in parameter '__src' (wa...

Unicode strings
So I'm trying to be a better Unicode citizen. Therefore, I'm going to abolish char* from my project and move to wchar_t. Unfortunately, jschar is 16 bits and wchar_t is 32 bits. Has anyone tried typedef-ing jschar to wchar_t? Or making jschar 32 bits some other way? -- Jeff Watkins http://metrocat.org 'I know about people who talk about suffering for the common good. It's never bloody them! When you hear a man shouting "Forward, brave comrades!" you'll see he's the one behind the bloody big rock and the one wearing the only really ...

TSimpleDataSet...AsString does not return a Unicode string
Hi all, I'm using Delphi 2010 and I'm accessing a Firebird 2.1 database using dbExpress. The database code page is UTF8. Everything is working fine except that when I'm accessing a Blob "Sub_Type 1" field using a TSimpleDataSet or a TClientDataSet on a Windows XP computer. My code is something like: SimpleDataSet1.FieldByName('MyBlobField').AsString; The problem is that the string returned is not a Unicode string but a normal Ansi String. It should be a Unicode string. What is very strange is that it works just fine with a Windows Vista computer. ...

unicode string to code byte array ?
hello.. i am new to cgi perl and now i am writing web site that use mysql database on japanese font. I store the data in utf8 and the data in the mysql database table are as the following form. (this is just example) --------------- item_name --------------- &'#20013;&'#30000; , &'#12493;&'#12483;&'#12488:&'#12527:&'#12540:&'#12463: , &'#12469;&'#12540:&'#12496:&'#12540: , &'#65315;&'#65313:&'#65330; , ( i added apostrophe(') codes here for appearing c...

Accuracy of hashing Unicode string to byte?
The security chapters of SAMS [ASP.NET Unleashed] talks of converting a string into a Byte array Function Convert2ByteArray( strInput As String ) As Byte() Dim intCounter As Integer Dim arrChar As Char() arrChar = strInput.ToCharArray() Dim arrByte( arrChar.Length - 1 ) As Byte For intCounter = 0 To arrByte.Length - 1 arrByte( intCounter ) = Convert.ToByte( arrChar( intCounter ) ) Next Return arrByte End Function Following this example and using the MD5CryptoServiceProvider I have been able to hash strings with the exact same result as you'd...

string to string[]
Hi, I've got this code : string[] Params; string SQL = "SELECT * FROM T_MANAGEMENT_PAGES"; SqlCommand myCommand = new SqlCommand(SQL, myConnection); myConnection.Open(); SqlDataReader myReader = myCommand.ExecuteReader(); try { while (myReader.Read()) { Params_Type = myReader.GetValue(0).ToString(); } } catch { } finally { } myReader.Close(); My problem is to obtain Params_Type. But each time, it says : "impossible to convert '[object]' in 'string[]' &quo...

Web resources about - Am I correct in thinking that the only way to get ord() to return a value over 256 is to send the character as a Unicode string instead of a byte string? - perl.unicode

List of Killzone characters - Wikipedia, the free encyclopedia
Cpl. Dante Garza is a loyal, optimistic and likable character- an effective team player. A close companion of Sev, the pair have served together ...

If Star Wars Characters Were Lawyers
Would Luke Skywalker make the best Biglaw associate of all time?

Macaulay Culkin reprises his 'Home Alone' character for the debut skit of Moldy Peaches' Jack Dishel's ...
by Andrew Sacher It's Christmas time, and if you're anything like me that means you're gonna watch Home Alone and Home Alone 2 at least once ...

‘Ghostbusters’ Character Posters Are Visual Equivalent Of Crossing The Streams
Forget the camp — it looks like these female paranormal investigators are taking their job seriously. Here are the rather dark first character ...

The best new characters of 2015
... (of both consoles and humans). They are household names: Mario, Sonic, Master Chief, Solid Snake. This post is not about them. No, the characters ...

Mark Zuckerberg's Baby Daughter Max Is the Cutest 'Star Wars' Character See the Pic!
Mark Zuckerberg's Baby Daughter Max Is the Cutest 'Star Wars' Character See the Pic!

The Case For 'Character Creator: The Game'
... joke that when you load up any game that lets you custom make your own hero, that the first one to three hours will be spent in the character ...

Ben Carson Spoke To Your Favorite Star Wars Character, Reince Pubis
On Tuesday, Ben Carson said he spoke to Reince Pubis. Pubis—a plump, humanoid Jedi Master with dark red hair, and an affable scholar of Jedi ...

These New Ghostbusters Character Posters Are Just Badass
The new Ghostbusters movie is still half a year away, but at least now we’re starting to get teased with some pretty sweet images. Yesterday, ...

Macaulay Culkin Reprises His 'Home Alone' Character As An Adult In New Web Series
... you don't even have to settle for a rewatch thanks to a new video featuring Macaulay Culkin in a role that strongly recalls his famous character ...

Resources last updated: 12/19/2015 10:59:57 AM