Bug#633511: libwww-perl: Incorrect encoding handling for text/html files with LWP::Simple::get and insufficient documentation

Vincent Lefevre vincent at vinc17.net
Mon Jul 11 01:30:45 UTC 2011


Package: libwww-perl
Version: 6.02-1
Severity: normal
Tags: upstream

This bug report is more or less what I gave on

  https://rt.cpan.org/Public/Bug/Display.html?id=69393

with some additional information concerning Debian.

When a file declared as iso-8859-1 and served as text/html is also
a valid UTF-8 file, LWP::Simple::get from libwww-perl 6.02 regards
it as a UTF-8 encoded file. This is incorrect.

For instance, with lwp-dump being

#!/usr/bin/env perl

use strict;
use Devel::Peek;
use LWP::Simple;

@ARGV == 1 or die "Usage: $0 <URL>\n";
my $url = shift;
my $file = LWP::Simple::get($url);
defined $file or die "$0: can't fetch $url\n";
Dump $file;

and when running

  for i in 1a 1h 2a 2h
  do
    ./lwp-dump http://www.vinc17.net/test/perl-lwp-test$i.xml \
        2> perl-lwp-test$i.dump
  done

I get (see perl-lwp-test1h.dump in particular):

==> perl-lwp-test1a.dump <==
SV = PV(0x194dac8) at 0x6a02d0
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK,UTF8)
  PV = 0x1308cd0 "<?xml version=\"1.0\" encoding=\"iso-8859-1\"?>\n<root>post\303\203\302\251... A</root>\n"\0 [UTF8 "<?xml version="1.0" encoding="iso-8859-1"?>\n<root>post\x{c3}\x{a9}... A</root>\n"]
  CUR = 71
  LEN = 80

==> perl-lwp-test1h.dump <==
SV = PV(0x194dac8) at 0x6a02d0
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK,UTF8)
  PV = 0x13097d0 "<?xml version=\"1.0\" encoding=\"iso-8859-1\"?>\n<root>post\303\251... A</root>\n"\0 [UTF8 "<?xml version="1.0" encoding="iso-8859-1"?>\n<root>post\x{e9}... A</root>\n"]
  CUR = 69
  LEN = 80

==> perl-lwp-test2a.dump <==
SV = PV(0x194dac8) at 0x6a02d0
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK,UTF8)
  PV = 0x1308cd0 "<?xml version=\"1.0\" encoding=\"iso-8859-1\"?>\n<root>post\303\203\302\251... \303\203</root>\n"\0 [UTF8 "<?xml version="1.0" encoding="iso-8859-1"?>\n<root>post\x{c3}\x{a9}... \x{c3}</root>\n"]
  CUR = 72
  LEN = 80

==> perl-lwp-test2h.dump <==
SV = PV(0x194dac8) at 0x6a02d0
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK,UTF8)
  PV = 0x1309850 "<?xml version=\"1.0\" encoding=\"iso-8859-1\"?>\n<root>post\303\203\302\251... \303\203</root>\n"\0 [UTF8 "<?xml version="1.0" encoding="iso-8859-1"?>\n<root>post\x{c3}\x{a9}... \x{c3}</root>\n"]
  CUR = 72
  LEN = 80

Note: my examples are not HTML files, but this doesn't matter. I first
thought the problem occurred for all text/* files (e.g. text/xml, that's
why I just wrote basic XML files), but in fact only text/html seems to
be affected.

How the bug should be fixed depends on the expected behavior. However
LWP::Simple::get is not sufficiently documented. This means that the
other cases are potentially wrong too. Indeed, in lenny, I always get
a sequence of bytes (no UTF8 flag):

==> perl-lwp-test1a.dump <==
SV = PVIV(0x1b1ef38) at 0x1bec568
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK)
  IV = 0
  PV = 0x1c04130 "<?xml version=\"1.0\" encoding=\"iso-8859-1\"?>\n<root>post\303\251... A</root>\n"\0
  CUR = 69
  LEN = 72

==> perl-lwp-test1h.dump <==
SV = PVIV(0x166af38) at 0x1738568
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK)
  IV = 0
  PV = 0x1750130 "<?xml version=\"1.0\" encoding=\"iso-8859-1\"?>\n<root>post\303\251... A</root>\n"\0
  CUR = 69
  LEN = 72

==> perl-lwp-test2a.dump <==
SV = PVIV(0x2150f38) at 0x221e568
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK)
  IV = 0
  PV = 0x2236130 "<?xml version=\"1.0\" encoding=\"iso-8859-1\"?>\n<root>post\303\251... \303</root>\n"\0
  CUR = 69
  LEN = 72

==> perl-lwp-test2h.dump <==
SV = PVIV(0x1752f38) at 0x1820568
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK)
  IV = 0
  PV = 0x1838130 "<?xml version=\"1.0\" encoding=\"iso-8859-1\"?>\n<root>post\303\251... \303</root>\n"\0
  CUR = 69
  LEN = 72

and in squeeze, ditto except perl-lwp-test1h.dump, which is already
wrong:

==> perl-lwp-test1a.dump <==
SV = PV(0x23ce758) at 0x1e455f0
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK)
  PV = 0x23ce5b0 "<?xml version=\"1.0\" encoding=\"iso-8859-1\"?>\n<root>post\303\251... A</root>\n"\0
  CUR = 69
  LEN = 72

==> perl-lwp-test1h.dump <==
SV = PV(0x2afe758) at 0x25755f0
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK,UTF8)
  PV = 0x2d5f9f0 "<?xml version=\"1.0\" encoding=\"iso-8859-1\"?>\n<root>post\303\251... A</root>\n"\0 [UTF8 "<?xml version="1.0" encoding="iso-8859-1"?>\n<root>post\x{e9}... A</root>\n"]
  CUR = 69
  LEN = 72

==> perl-lwp-test2a.dump <==
SV = PV(0x2a5d758) at 0x24d45f0
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK)
  PV = 0x2a5d5b0 "<?xml version=\"1.0\" encoding=\"iso-8859-1\"?>\n<root>post\303\251... \303</root>\n"\0
  CUR = 69
  LEN = 72

==> perl-lwp-test2h.dump <==
SV = PV(0x28cd758) at 0x23445f0
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK)
  PV = 0x2b8e0c0 "<?xml version=\"1.0\" encoding=\"iso-8859-1\"?>\n<root>post\303\251... \303</root>\n"\0
  CUR = 69
  LEN = 72

A sequence of bytes is probably what one expects for files without
a HTTP charset (e.g. served as application/xml).

Also, what happens if a file is sent as text/html with UTF-8 charset,
but isn't a valid UTF-8 file?

The problem with the 1h file may come from HTTP::Message, with a
default charset guessed by content_charset(), if LWP::Simple::get
uses decoded_content from HTTP::Message with a default charset
guessed by content_charset(). Charset guessing should strictly
follow the explicit rules from

  http://www.w3.org/TR/REC-html40/charset.html#spec-char-encoding

to avoid inconsistencies like here.

-- System Information:
Debian Release: wheezy/sid
  APT prefers unstable
  APT policy: (500, 'unstable'), (500, 'testing'), (500, 'stable'), (1, 'experimental')
Architecture: amd64 (x86_64)

Kernel: Linux 2.6.39-2-amd64 (SMP w/2 CPU cores)
Locale: LANG=POSIX, LC_CTYPE=en_US.ISO8859-1 (charmap=ISO-8859-1)
Shell: /bin/sh linked to /bin/dash

Versions of packages libwww-perl depends on:
ii  ca-certificates               20110502   Common CA certificates
ii  libencode-locale-perl         1.02-1     utility to determine the locale en
ii  libfile-listing-perl          6.01-1     module to parse directory listings
ii  libhtml-parser-perl           3.68-1+b1  collection of modules that parse H
ii  libhtml-tagset-perl           3.20-2     Data tables pertaining to HTML
ii  libhtml-tree-perl             4.2-1      Perl module to represent and creat
ii  libhttp-cookies-perl          6.00-2     HTTP cookie jars
ii  libhttp-date-perl             6.00-1     module of date conversion routines
ii  libhttp-message-perl          6.01-1     perl interface to HTTP style messa
ii  libhttp-negotiate-perl        6.00-2     implementation of content negotiat
ii  liblwp-mediatypes-perl        6.01-1     module to guess media type for a f
ii  liblwp-protocol-https-perl    6.02-1     https driver for LWP::UserAgent
ii  libnet-http-perl              6.01-1     module providing low-level HTTP co
ii  liburi-perl                   1.58-1     module to manipulate and access UR
ii  libwww-robotrules-perl        6.01-1     database of robots.txt-derived per
ii  netbase                       4.46       Basic TCP/IP networking system
ii  perl                          5.12.4-1   Larry Wall's Practical Extraction 

Versions of packages libwww-perl recommends:
ii  libauthen-ntlm-perl           1.08-1     authentication module for NTLM
ii  libhtml-form-perl             6.00-1     module that represents an HTML for
pn  libhtml-format-perl           <none>     (no description available)
ii  libhttp-daemon-perl           6.00-1     simple http server class
ii  libmailtools-perl             2.08-1     Manipulate email in perl programs

libwww-perl suggests no packages.

-- no debconf information





More information about the pkg-perl-maintainers mailing list