[xml/sgml-pkgs] Bug#649189: libxml2-utils: Html parser accepts invalid element name (starting with full stop) to document tree

Zsban Ambrus ambrus at math.bme.hu
Fri Nov 18 17:24:48 UTC 2011


Package: libxml2-utils
Version: 2.7.8.dfsg-2+squeeze1
Severity: normal



Dear maintainer,

When the html parser of libxml2 (with the recover option) meets a tag where 
the tag name starts with a full stop, it correctly detects that this is
invalid HTML, but nevertheless accepts the tag with that name into the
document tree.  This means that if you output the same document tree as XML,
you get an output that is malformed XML.

Here's an example.

$ xmllint --html --xmlout - <<<'<.m>r'
-:1: HTML parser error : Tag .m invalid
<.m>r
   ^
<?xml version="1.0" standalone="yes"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><.m>r
</.m></body></html>
$

The `<.m>' part is not well-formed XML, because XML element names cannot
start with a full stop.  You can see this if you try to parse the output
with an XML parser, eg. with xmllint.

In case you're interested, I have noticed this bug when I tried to parse
some (invalid) HTML documents with the perl module XML::LibXML (which is
using the libxml2 library as its backend) and output them as XML.


-- System Information:
Debian Release: 6.0.3
  APT prefers stable-updates
  APT policy: (500, 'stable-updates'), (500, 'stable')
Architecture: amd64 (x86_64)

Kernel: Linux 2.6.37 (SMP w/2 CPU cores)
Locale: LANG=C, LC_CTYPE=hu_HU (charmap=ISO-8859-2)
Shell: /bin/sh linked to /bin/bash

Versions of packages libxml2-utils depends on:
ii  libc6              2.11.2-10             Embedded GNU C Library: Shared lib
ii  libreadline6       6.1-3                 GNU readline and history libraries
ii  libxml2            2.7.8.dfsg-2+squeeze1 GNOME XML library

libxml2-utils recommends no packages.

libxml2-utils suggests no packages.

-- no debconf information






More information about the debian-xml-sgml-pkgs mailing list