[Pkg-haskell-maintainers] Bug#748125: System.Posix.Directory.readDirStream can return strings that S.P.Files.getFileStatus cannot use

Robert Bihlmeyer r.bihlmeyer at arrowecs.at
Thu May 15 08:34:55 UTC 2014


Hi,

Joachim Breitner <nomeata at debian.org> writes:

> Am Mittwoch, den 14.05.2014, 17:00 +0200 schrieb Robert Bihlmeyer:
>> I don't think running a program with LC_CTYPE=*.UTF-8 means that all
>> filenames that it encounters have to be valid UTF-8.
>
> the problem is: What else should they be? The "String" type represents
> unicode characters, so using that for a file name requires them to be
> decoded somehow.

I agree, there is no straight-forward solution.

Interestingly, most of the invalid UTF-8 I tried survived the roundtrip
through String. What doesn't work in these cases is outputting this
String -- but I wouldn't expect it to. But getFileStatus accepts the
String and stats the right file (can be proven with "strace -fe stat"
for example).

Up to now I found exactly one class of byte sequences that do not work:
illegal (sub-optimal) encodings of ASCII characters. The attached tar
contains a filename with the two bytes C0 and B7 followed by
'.txt'. C0B7 is an invalid encoding of 37 i.e. '7'. 

It looks like GHC accepts the invalid encoding and stores the result as
the normal character '7'. The error points in this direction:

  dirtest.hs: 7.txt: getFileStatus: does not exist (No such file or directory)

Contrary to that, a sub-optimal encoding of 'ö' (U+00F6) as E0 83 B6
works fine, as do the numerous other illegal combinations of
high-bit-set characters I tried.

So my assumption is that there is special casing if the result of UTF-8
decoding is an ASCII character.

> I guess the solution, which you have found already, for uses where
> arbitrary filenames need to work is to use a type that is meant for
> that, i.e. ByteString.

Maybe deprecating the interfaces that assume UTF-8 clean filenames is
the solution. One (unfortunately) still can't assume that all the world
is UTF-8.

But most illegal sequences are round-trippable -- e.g. the E0 83 B6 from above
is not re-encoded/corrected to C3 B6. Therefore, my question is whether
the ASCII special case could be removed.

br,
-- 
Robert Bihlmeyer    ASSIST    Arrow ECS Internet Security AG
<r.bihlmeyer at arrowecs.at>   A-1100 Wien, Wienerbergstraße 11
Tel: +43 1 370 94 40                Fax: +43 1 370 94 40-333



More information about the Pkg-haskell-maintainers mailing list