Bug#1020574: perl-doc: encoding issue / spelling mistake with "perldoc perlfaq4"

Russ Allbery rra at debian.org
Fri Sep 23 18:13:50 BST 2022


Vincent Lefevre <vincent at vinc17.net> writes:

> "perldoc perlfaq4" gives in UTF-8 locales

> [...]
>     The trick to this problem is avoiding accidental autovivification. If
>     you want to check three keys deep, you might na<EF>vely try this:

> where <EF> is actually the EF byte as shown by the "less" pager.

This is an interesting bug.  I'm going to have to dig in a bit to figure
out what's going on here.  The POD source has naE<0xEF>vely, and for some
reason the output is ISO 8859-1 instead of UTF-8.

The underlying formatting module is Pod::Text, and it defaults to using
the same output character set as the input character set, which in this
case is not specified.  I think there may be an old default in play.

Pod::Man breaks here in a different way because it interprets the
diaeresis as a German umlaut and assumes you can just stick an e after it
if you don't have umlauts available.  My understanding is that this German
umlaut conversion is only correct for ä, ö, and ü, not for ï (which I
don't believe is a character in German, at least from some quick
searching).  I think this may be a very long-standing bug, although
there's a deeper problem that one cannot assume German umlaut rules.  It
depends very much on the source language.

> This should be encoded in UTF-8. However, this is a spelling mistake:
> contrary to French, there is no ï in English (at least, my dictionaries
> cannot find such a variant): naively.

naïvely (and naïve) are correct alternate spellings in English.  English
historically uses a diaeresis to indicate that two adjacent vowels form
separate syllables rather than a diphthong.  This is one of the only
"native" accept marks in the English language, which otherwise only uses
accept marks in loan words and tends to drop them.

It's common in modern English writing to drop the diaeresis, in part
because US English keyboards tend to make typing them difficult, so both
usages are now accepted, but there is a school of thought that the version
with the diaeresis is more correct.  The New Yorker famously insists on
diaereses in its house style, even going so far as to use coöperate when
every other publication has switched to cooperate:

https://www.merriam-webster.com/words-at-play/mary-norris-diaeresis

The other place you'll sometimes see diaereses in English is with proper
names such as Chloë or Zoë.

-- 
Russ Allbery (rra at debian.org)              <https://www.eyrie.org/~eagle/>




More information about the Perl-maintainers mailing list