Bug#516129: perl-modules: CGI.pm unwanted UTF-8 conversion in URLs

Mon Feb 23 10:01:04 UTC 2009

Dear Niko,

> > > > Function url(-path-info=>1) does not work well if I have ISO-8859-2
> > > > accented chars in the URL. Utility function CGI::Util::escape()
> > > > unconditionally forces an ISO-8859-1 -> UTF-8 conversion:
> > > > 
> > > >   # force bytes while preserving backward compatibility -- dankogai
> > > >   $toencode = pack("C*", unpack("U0C*", $toencode));
> 
> > Unfortunately 3.38 does not work.
> 
> OK, thanks.
> 
> I must admit I'm a bit confused about the problem. Could you please
> give a simple test case (either a command-line version or a CGI script)
> with the current result and the one you'd expect?

See below.

> As far as I can see (looking at 3.29), url(-path-info=>1) will unescape()
> the PATH_INFO variable into 8-bit characters and then encode those manually
> into URL encoding with sprintf() as the last thing in the url() function.
> 
> I can't see CGI::Util::escape() being called here - are you calling
> that manually?

url() calls query_string() that calls escape().

> I do get your point about the idempotency of course:
> 
> % perl -MCGI::Util=escape,unescape -E 'say escape(unescape("%E4"))'  
> %C3%A4
> 
> but it's not clear to me what this breaks, particularly as those aren't
> public subroutines.

This is the scenario:

My CGI program runs and produces an ISO-8859-2 encoded HTML page
with a form that processed by GET method.
User enters some accented chars (e.g. "ä") in form then clicks submit button.
Browser honors encoding and sends back a latin2 encoded URL to server
like http://www.example.com/sample.cgi&search=%E4 .
CGI program unescapes query string and stores internally as {search=>"\xe4"}.
When it calls url() in order to place a self pointing URL on next HTML page.
Sub url() calls query_string() that uses CGI::Util::escape to produce this:
http://www.example.com/sample.cgi&search=%C3A4 .
This is because escape() assumes that HTML page encoded in UTF-8.
However if the user follows this link, browser sends back the wrong URL.
after unescaping stores {search=>"\xc3\xa4"} and prints
http://www.example.com/sample.cgi&search=%C3%83%C2%A4 in the next round
and so on.

I could not demonstrate this behavior off-line.
But I set up a short demo program that you can test with your browser
if necessary.

This script below shows no more than your one liner above.

------------------8<---------------------8<---------------
#!/usr/bin/perl

use strict;
use CGI::Util;
use Dumpvalue; my $dumper=Dumpvalue->new(quoteHighBit=>1);

my $latin2_string = "a\341e\351i\355o\363\366\365u\372\374\373"; #aáeéiíoóöőuúüű
$dumper->dumpValue($latin2_string);

my $escaped_string = CGI::Util::escape($latin2_string);
$dumper->dumpValue($escaped_string);

my $unescaped_string = CGI::Util::unescape($escaped_string);
$dumper->dumpValue($unescaped_string);
------------------8<---------------------8<---------------

Ooops! Stop the press.
I've just noticed in 3.29 source that CGI::Util::escape is substantially
changed. It seems to be good for my purposes:
$toencode = pack("C*", unpack("C*", $toencode));
Note: this line can be omitted. :-)
(However may cause problems if someone wants to use UTF-8.)
Unfortunately the latest version (3.42) is confused again. :-(

Actually I defined my own MyCGI subclass that overrides CGI::query_string()
and CGI::Util::escape(). This  works for me but is not a simple and
elegant solution.

Cheers

Gabor