Bug#798727: Encode::Unicode decode() dies unnecessarily

Fri Sep 11 23:40:29 UTC 2015

Package: perl
Version: 5.20.2-2

The Encode::Unicode documentation states the following:

When BE or LE is omitted during decode(), it checks if BOM is at the
beginning of the string; if one is found, the endianness is set to what
the BOM says. If no BOM is found, the routine dies.

To reproduce:
---
use Encode qw/decode/;
decode("utf-16be", "Hello World"); # does not die
decode("utf-16le", "Hello World"); # does not die
decode("utf-16", "\xFE\xFFHello World"); # does not die
decode("utf-16", "Hello World"); # dies with "UTF-16:Unrecognised BOM"
---

Unicode Standard version 8.0:

The UTF-16 encoding scheme may or may not begin with a BOM. However,
when there is no BOM, and in the absence of a higher-level protocol, the
byte order of the UTF-16 encoding scheme is big-endian.

RFC2781:

If the first two octets of the text is not 0xFE followed by
0xFF, and is not 0xFF followed by 0xFE, then the text SHOULD be
interpreted as being big-endian.

There is a simple fix of doing nothing:

diff --git a/cpan/Encode/Unicode/Unicode.xs b/cpan/Encode/Unicode/Unicode.xs
index cf42ab8..7caf1c1 100644
--- a/cpan/Encode/Unicode/Unicode.xs
+++ b/cpan/Encode/Unicode/Unicode.xs
@@ -164,9 +164,18 @@ CODE:
                endian = 'V';
            }
            else {
-               croak("%"SVf":Unrecognised BOM %"UVxf,
-                     *hv_fetch((HV *)SvRV(obj),"Name",4,0),
-                     bom);
+               /* No BOM found, use big-endian fallback as specified in
+                * RFC2781 and the Unicode Standard version 8.0:
+                *
+                *  The UTF-16 encoding scheme may or may not begin with
+                *  a BOM. However, when there is no BOM, and in the
+                *  absence of a higher-level protocol, the byte order
+                *  of the UTF-16 encoding scheme is big-endian.
+                *
+                *  If the first two octets of the text is not 0xFE
+                *  followed by 0xFF, and is not 0xFF followed by 0xFE,
+                *  then the text SHOULD be interpreted as big-endian.
+                */
            }
        }
 #if 1

CPAN bug report: https://rt.cpan.org/Ticket/Display.html?id=107043