Bug#750946: libhtml-html5-parser-perl: UTF-8 character confuses the parser

Vincent Lefevre vincent at vinc17.net
Wed Oct 22 12:13:17 UTC 2014


Control: retitle -1 libhtml-html5-parser-perl: UTF-8 character breaks parse_file

As a consequence of this bug, html2xhtml doesn't work at all when
applied on a file. No problems when the HTML document is provided
in the standard input, though. For instance, with test.html as:

<!DOCTYPE html>
<html><body><p>Test €</p></body></html>

I get:

$ html2xhtml test.html
<?xml version="1.0" encoding="windows-1252"?>
<html xmlns="http://www.w3.org/1999/xhtml"><head/><body/></html>

$ html2xhtml < test.html
<?xml version="1.0" encoding="utf-8"?>
<html xmlns="http://www.w3.org/1999/xhtml"><head/><body><p>Test €</p>
</body></html>

and with test.html as:

<!DOCTYPE html>
<html><body><p>Test é</p></body></html>

$ html2xhtml test.html
<?xml version="1.0" encoding="utf-8"?>
<html xmlns="http://www.w3.org/1999/xhtml"><head/><body><p>Test �</p>
</body></html>

$ html2xhtml < test.html
<?xml version="1.0" encoding="utf-8"?>
<html xmlns="http://www.w3.org/1999/xhtml"><head/><body><p>Test é</p>
</body></html>

parse_file is used in the former test (like in my original bug report),
and parse_string is used in the latter test. Thus it seems that's
parse_file that is broken. Hence the retitle.

-- 
Vincent Lefèvre <vincent at vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)



More information about the pkg-perl-maintainers mailing list