Bug#655335: HTML parsing now breaks on entities and mismatched tags

Josh Triplett josh at joshtriplett.org
Tue Jan 10 13:30:15 UTC 2012


Package: get-flash-videos
Version: 1.25~git2011.09.26-2
Severity: normal

At some point recently, get-flash-videos started breaking whenever it
tries to parse HTML.  It complains about improperly paired tags, fails
to parse standard HTML entities like $nbsp; and ↑, and it tries to
parse the && in Javascript as an entity.

If it matters, this occurred when attempting to use get-flash-videos on
CollegeHumor URLs.  For example:

$ ./get-flash-videos 'http://www.collegehumor.com/video/3505939/font-conference'
Downloading http://www.collegehumor.com/video/3505939/font-conference
Using method 'collegehumor' for http://www.collegehumor.com/video/3505939/font-conference
Error: :39: parser error : Opening and ending tag mismatch: meta line 4 and head
</head>
       ^
:207: parser error : Entity 'uarr' not defined
        <div id="btn_upload" class="button"><a href="/submit">Submit ↑</a><
                                                                           ^
:222: parser error : Entity 'nbsp' not defined
                        <a href="javascript:void(0);" class="close" id="login_cancel"> </a>
                                                                                            ^
:684: parser error : Entity 'copy' not defined
                                                                                                <p>© 2012 Connected Ventures, LLC. All rights reserved. | Broug
                                                                                                         ^
:760: parser error : xmlParseEntityRef: no name
                                        if(e.target && e.target.nodeName == 'IFRAME') {
                                                     ^
:760: parser error : xmlParseEntityRef: no name
                                        if(e.target && e.target.nodeName == 'IFRAME') {
                                                      ^
:838: parser error : Opening and ending tag mismatch: head line 3 and html
</html>
       ^
:839: parser error : Premature end of data in tag html line 2

^
 (from FlashVideo::Site::Collegehumor::./get-flash-videos::1512)



I don't know whether get-flash-videos has changed how it invokes libxml,
or whether libxml's HTML parsing has broken.

- Josh Triplett

-- System Information:
Debian Release: wheezy/sid
  APT prefers unstable
  APT policy: (500, 'unstable'), (500, 'stable'), (1, 'experimental')
Architecture: amd64 (x86_64)

Kernel: Linux 3.1.0-1-amd64 (SMP w/4 CPU cores)
Locale: LANG=C.UTF-8, LC_CTYPE=C.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash

Versions of packages get-flash-videos depends on:
ii  libdata-amf-perl       0.09-3
ii  libhtml-parser-perl    3.69-1+b1
ii  libtie-ixhash-perl     1.21-2
ii  liburi-perl            1.59-1
ii  libwww-mechanize-perl  1.71-1
ii  libwww-perl            6.03-1
ii  perl                   5.14.2-6
ii  rtmpdump               2.4+20111222.git4e06e21-1

Versions of packages get-flash-videos recommends:
ii  get-iplayer                 <none>
ii  libcrypt-rijndael-perl      1.08-1+b2
ii  liblwp-protocol-socks-perl  <none>
ii  libxml-simple-perl          2.18-3

Versions of packages get-flash-videos suggests:
ii  mplayer  2:1.0~rc4.dfsg1+svn33713-5

-- no debconf information





More information about the pkg-perl-maintainers mailing list