[DRE-maint] Bug#534721: libhpricot-ruby1.8: Hpricot's XML parser fails to parse simple, valid XML

T Chan something-bz at sodium.serveirc.com
Fri Jun 26 17:16:08 UTC 2009


Package: libhpricot-ruby1.8
Version: 0.8-2
Severity: grave
Justification: renders package unusable


This bug also applies to libhpricot-ruby1.9.

Problems:
- Valid XML is rendered invalid.
- XML is no longer parseable.
- Invalid XML is not rejected by default (required by the standard). (minor)

Workaround:
  $ aptitude install libhpricot-ruby1.8=0.6-2

Discussion:

Closing tags are sometimes not parsed correctly; causing the parser to "helpfully" add closing tags. Whether this happens or not seems to be pseudorandom:
  $ ruby -e "require 'hpricot'; print Hpricot.XML('<aaaa></aaaa>')"
  <aaaa></aaaa>
  $ ruby -e "require 'hpricot'; print Hpricot.XML('<zzzz></zzzz>')"
  <zzzz></zzzz></zzzz>

The effect is similar to the (incorrect) behaviour when it detects malformed XML:
  $ ruby -e "require 'hpricot'; print Hpricot.XML('<a></b>')"
  <a></b></a>
  $ ruby -e "require 'hpricot'; print Hpricot.XML('<a>b')"
  <a>b</a>

The unparsed tag appears to be treated like <zzzz/>:
  $ ruby -e "require 'hpricot'; print Hpricot.XML('<zzzz></zzzz>').search('/zzzz')"
  <zzzz></zzzz></zzzz>
  $ ruby -e "require 'hpricot'; print Hpricot.XML('<zzzz></zzzz>').search('/zzzz/zzzz')"
  </zzzz>

This causes the nesting to break, rendering most XML completely unparseable:
  $ ruby -e "require 'hpricot'; print Hpricot.XML('<a><zzzz></zzzz><b></b></a>')"
  <a><zzzz></zzzz><b></b></zzzz></a>
  $ ruby -e "require 'hpricot'; print Hpricot.XML('<a><zzzz></zzzz><b></b></a>').search('/a/b')"
(no output)
  $ ruby -e "require 'hpricot'; print Hpricot.XML('<a><zzzz></zzzz><b></b></a>').search('/a/zzzz/b')"
  <b></b>

This might be related to how Hpricot treats uncrecognized closing tags.
  0.6-2 closes the correct tag, ignoring the contents of the closing tag (this is also invalid behaviour for an XML parse):
    $ ruby -e "require 'hpricot'; print Hpricot.XML('<a></b>')"
    <a></a>
  0.8-2 is broken as above:
    $ ruby -e "require 'hpricot'; print Hpricot.XML('<a></b>')"
    <a></b></a>

I suspect the problem is in hpricot_scan.so, but hpricot_scan.c is full of auto-generated code.

-- System Information:
Debian Release: squeeze/sid
  APT prefers testing
  APT policy: (990, 'testing'), (500, 'stable')
Architecture: i386 (x86_64)

Kernel: Linux 2.6.26-2-amd64 (SMP w/2 CPU cores)
Locale: LANG=en_GB.UTF-8, LC_CTYPE=en_GB.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/bash

Versions of packages libhpricot-ruby1.8 depends on:
ii  libc6                        2.9-12      GNU C Library: Shared libraries
ii  libruby1.8                   1.8.7.174-1 Libraries necessary to run Ruby 1.

libhpricot-ruby1.8 recommends no packages.

libhpricot-ruby1.8 suggests no packages.

-- no debconf information






More information about the Pkg-ruby-extras-maintainers mailing list