CGI.pm url_encoding problem

Hi!

This is the code:

use CGI;
set_url_encoding('utf-8');

The problem is that "use CGI" automagically initializes the parameters 
*before* I set the encoding of them, so set_url_encoding will run too late.

Any idea?

Bye,
   Andras

0
andras
4/18/2005 9:16:25 AM
perl.perl6.compiler 1237 articles. 0 followers. Follow

8 Replies
734 Views

Similar Articles

[PageSpeed] 54

Andras,

Well once we have a proper "use", we should be able to set the encoding=20=

at compile time. But until then, I see a few possible options:

- setting the url encoding forces a re-encoding of any parameters=20
already encoded.

This means extra work if you change the encoding, but it will only=20
happen once.

- moving the decoding process to be "on demand" when fetching the params

This would slow do the param() function, but would mean you only=20
decoded exactly what you needed and nothing more.

Either one is a simple change.

- Stevan


On Apr 18, 2005, at 5:16 AM, B=C1RTH=C1ZI Andr=E1s wrote:
> Hi!
>
> This is the code:
>
> use CGI;
> set_url_encoding('utf-8');
>
> The problem is that "use CGI" automagically initializes the parameters=20=

> *before* I set the encoding of them, so set_url_encoding will run too=20=

> late.
>
> Any idea?
>
> Bye,
>   Andras
>
>

0
stevan
4/18/2005 12:52:28 PM
>>>>> "B�RTH�ZI" == B�RTH�ZI Andr�s <andras@barthazi.hu> writes:

B�RTH�ZI> use CGI;
B�RTH�ZI> set_url_encoding('utf-8');

B�RTH�ZI> The problem is that "use CGI" automagically initializes the parameters
B�RTH�ZI> *before* I set the encoding of them, so set_url_encoding will run too
B�RTH�ZI> late.

Did I miss the memo where anything outside the list of valid
URI characters needed to be hexified, hence there's no need
for such a URL encoding scheme?  Where is this memo?

-- 
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
<merlyn@stonehenge.com> <URL:http://www.stonehenge.com/merlyn/>
Perl/Unix/security consulting, Technical writing, Comedy, etc. etc.
See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl training!
0
merlyn
4/18/2005 1:10:39 PM
Stevan,

> Well once we have a proper "use", we should be able to set the encoding 
> at compile time. But until then, I see a few possible options:

I think, it would be nice to find another solution.

> - setting the url encoding forces a re-encoding of any parameters 
> already encoded.
> 
> This means extra work if you change the encoding, but it will only 
> happen once.

It can't work (or with a big overhead), because POST parameters coming 
from the STDIN, and it's just readable once. If you would like to do it, 
then you have to store the whole input, which can be large.

> - moving the decoding process to be "on demand" when fetching the params
> 
> This would slow do the param() function, but would mean you only decoded 
> exactly what you needed and nothing more.

It sounds good, and I have another idea. What, if the first param() 
function call would trigger the whole paramter decoding? It's not an 
overhead, because you have to do the process if you would like to get a 
parameter, but an improvement, because if you don't want to query a 
parameter (you just include the CGI.pm just for to print header(), 
etc.), then there won't be processing + decoding.

Bye,
   Andras
0
andras
4/18/2005 1:16:02 PM
Randal,

> B�RTH�ZI> use CGI;
> B�RTH�ZI> set_url_encoding('utf-8');
> 
> B�RTH�ZI> The problem is that "use CGI" automagically initializes the parameters
> B�RTH�ZI> *before* I set the encoding of them, so set_url_encoding will run too
> B�RTH�ZI> late.
> 
> Did I miss the memo where anything outside the list of valid
> URI characters needed to be hexified, hence there's no need
> for such a URL encoding scheme?  Where is this memo?

Can you write it again with other words? Both Stevan and me are not 
understand.

Bye,
   Andras
0
andras
4/18/2005 2:25:57 PM
>>>>> "B=C1RTH=C1ZI" =3D=3D B=C1RTH=C1ZI Andr=E1s <andras@barthazi.hu> writ=
es:

>> Did I miss the memo where anything outside the list of valid
>> URI characters needed to be hexified, hence there's no need
>> for such a URL encoding scheme?  Where is this memo?

B=C1RTH=C1ZI> Can you write it again with other words? Both Stevan and me a=
re not
B=C1RTH=C1ZI> understand.

URLs are only 7 bit ASCII, according to the RFCs.  Did I miss a new RFC
where non-7-bit URLs are permitted?  If so, please point to that.

--=20
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
<merlyn@stonehenge.com> <URL:http://www.stonehenge.com/merlyn/>
Perl/Unix/security consulting, Technical writing, Comedy, etc. etc.
See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl trainin=
g!
0
merlyn
4/18/2005 2:29:12 PM
B�RTH�ZI Andr�s wrote:

> Randal,
> 
>> B�RTH�ZI> use CGI;
>> B�RTH�ZI> set_url_encoding('utf-8');
>>
>> B�RTH�ZI> The problem is that "use CGI" automagically initializes the 
>> parameters
>> B�RTH�ZI> *before* I set the encoding of them, so set_url_encoding 
>> will run too
>> B�RTH�ZI> late.
>>
>> Did I miss the memo where anything outside the list of valid
>> URI characters needed to be hexified, hence there's no need
>> for such a URL encoding scheme?  Where is this memo?
> 
> 
> Can you write it again with other words? Both Stevan and me are not 
> understand.

I believe that the standard for URL's calls for always encoding in utf-8 
but that all non-ascii bytes (bytes with the high bit set) are to be 
further encoded using %xx hex notation.  So the URL is always 
transmitted as an ascii string, but is easily converted into a utf-8 
string simply by converting the %xx codes back into binary bytes.  Thus 
firewalls and proxies need only deal with ascii.


-- 
mark@biggar.org
mark.a.biggar@comcast.net
0
mark
4/18/2005 2:38:28 PM
Hi,

Randal L. Schwartz wrote:
>>>>>>"B�RTH�ZI" == B�RTH�ZI Andr�s <andras@barthazi.hu> writes:
> 
>>>Did I miss the memo where anything outside the list of valid
>>>URI characters needed to be hexified, hence there's no need
>>>for such a URL encoding scheme?  Where is this memo?
> 
> 
> B�RTH�ZI> Can you write it again with other words? Both Stevan and me are not
> B�RTH�ZI> understand.
> 
> URLs are only 7 bit ASCII, according to the RFCs.  Did I miss a new RFC
> where non-7-bit URLs are permitted?  If so, please point to that.

You are right, in URLs just 7 bit ASCII is allowed. But you can store 
any character in an URL, if you encode it with "URL encoding". For 
example UTF-8 "�" is coded as "%C3%A1".

RFC 1738 [1], part 2.2 is writing about it (just about iso-8859-1 
encoding). Or you can read a short tutorial about it at Blooberry[2]. 
Don't tell me, that you never heard this before. :)

Anyway, it's not just about URL encoding (the URL and the GET 
parameters), but POST parameters working the same way.

Bye,
   Andras

[1] http://www.rfc-editor.org/rfc/rfc1738.txt
[2] http://www.blooberry.com/indexdot/html/topics/urlencoding.htm
0
andras
4/18/2005 2:44:06 PM
Hi,

> I believe that the standard for URL's calls for always encoding in utf-8 
> but that all non-ascii bytes (bytes with the high bit set) are to be 
> further encoded using %xx hex notation.  So the URL is always 
> transmitted as an ascii string, but is easily converted into a utf-8 
> string simply by converting the %xx codes back into binary bytes.  Thus 
> firewalls and proxies need only deal with ascii.

You're right, except one thing: when the standard was created, there 
were no UTF-8 encoding, so it can't be the default. I think that the 
standard is not talking about how the non-ASCII characters are encoded 
(iso-8859-* or utf-8 or else). And I know and I'm sure in it, that 
browsers are sending back non-ASCII characters by the same encoding as 
the page of the form was coded - so no UTF-8 is the default, there is no 
default.

Bye,
   Andras
0
andras
4/18/2005 2:50:54 PM
Reply: