[xml/sgml-pkgs] Bug#378411: Buffer overflow in XML::Parser::Expat triggered by utf8

Joris van Rantwijk rantwijk at science.uva.nl
Mon Aug 7 08:53:38 UTC 2006

On Sat, 2006-08-05 at 14:12 -0400, Joey Hess wrote:
> Would just calling Encode::decode_utf8 on the input string in Expat.pm
> be the simplest fix?

I'm not sure, but I think not.
First of all, in the case I reported, the parser reads directly from an
input stream. The data is then not touched by Expat.pm, but handled
internally in Expat.xs.

It seems to me that the reported overflow can not be triggered in the
case where a string (as opposed to a stream) is passed to XML::Parser.
It also seems to me that XML parsing on a string will proceed correctly
regardless of whether the string is logical Unicode or raw UTF8, since
both kinds of strings are essentially the same at the level of Perl

Secondly, the cause of the reported stream parsing problem is not that
Expat does not handle UTF8 data; it handles that fine. The problem is
that it *expects* raw UTF8 bytes but, in my case, gets logical Unicode
characters instead. It breaks on that because of an invalid assumption
in the buffer management code.

I dived into Expat.xs again and believe I have a simple fix that stops
the buffer management from overflowing the heap. Due to Perl's identical
internal treatment of utf8 and Unicode, this should be all that is
necessary to enable correct parsing of Unicode streams.

My patch is attached.
Basic testing suggests that it works as intended. But I have very little
experience with Perl XS coding, so I would recommend that somebody
reviews this before it is applied anywhere.

Thanks for pushing this forward a bit; we should get it fixed.


PS. (and slightly off-topic) My personal opinion is that Perl has
utterly messed up Unicode handling. The documentation uses the terms
"Unicode" and "UTF8" as if they were interchangable. In fact, and as we
see with this bug, there is a very important conceptual difference
between "a string containing N raw utf8 bytes" and "a string containing
M logical Unicode characters".

-------------- next part --------------
A non-text attachment was scrubbed...
Name: XML-Parser-2.34-unicodecrash.patch
Type: text/x-patch
Size: 1769 bytes
Desc: not available
Url : http://lists.alioth.debian.org/pipermail/debian-xml-sgml-pkgs/attachments/20060807/4cdf7e82/XML-Parser-2.34-unicodecrash.bin

More information about the debian-xml-sgml-pkgs mailing list