Bug#750946: libhtml-html5-parser-perl: UTF-8 character confuses the parser
Vincent Lefevre
vincent at vinc17.net
Wed Oct 22 12:13:17 UTC 2014
Control: retitle -1 libhtml-html5-parser-perl: UTF-8 character breaks parse_file
As a consequence of this bug, html2xhtml doesn't work at all when
applied on a file. No problems when the HTML document is provided
in the standard input, though. For instance, with test.html as:
<!DOCTYPE html>
<html><body><p>Test €</p></body></html>
I get:
$ html2xhtml test.html
<?xml version="1.0" encoding="windows-1252"?>
<html xmlns="http://www.w3.org/1999/xhtml"><head/><body/></html>
$ html2xhtml < test.html
<?xml version="1.0" encoding="utf-8"?>
<html xmlns="http://www.w3.org/1999/xhtml"><head/><body><p>Test €</p>
</body></html>
and with test.html as:
<!DOCTYPE html>
<html><body><p>Test é</p></body></html>
$ html2xhtml test.html
<?xml version="1.0" encoding="utf-8"?>
<html xmlns="http://www.w3.org/1999/xhtml"><head/><body><p>Test �</p>
</body></html>
$ html2xhtml < test.html
<?xml version="1.0" encoding="utf-8"?>
<html xmlns="http://www.w3.org/1999/xhtml"><head/><body><p>Test é</p>
</body></html>
parse_file is used in the former test (like in my original bug report),
and parse_string is used in the latter test. Thus it seems that's
parse_file that is broken. Hence the retitle.
--
Vincent Lefèvre <vincent at vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)
More information about the pkg-perl-maintainers
mailing list