Bug#898022: diffoscope: Traceback when comparing paths with invalid unicode characters

Mattia Rizzolo mattia at debian.org
Thu May 10 16:36:22 BST 2018


Control: tag -1 patch

On Sun, May 06, 2018 at 01:38:58AM +0100, Chris Lamb wrote:
> This is via <https://github.com/lamby/trydiffoscope/issues/35>, but I
> think the bug is in diffoscope itself.

It is, although one could say it's a bug in argparse.


> However, I can't seem to minimally reproduce with file by itself:
> 
>   import magic
>   filename = b'\xf0\x28\x8c\x28'
>   with open(filename, 'w'):
>       pass
>   m = magic.open(magic.NONE)
>   m.load()
>   m.file(filename)

That's because argparse decodes the arguments, you can get the same
traceback by using this instead of the last command above:

|>>> m.file(filename.decode('utf-8', errors='surrogateescape'))
|Traceback (most recent call last):
|  File "<stdin>", line 1, in <module>
|  File "/usr/lib/python3/dist-packages/magic/compat.py", line 148, in file
|    return Magic.__tostr(_file(self._magic_t, Magic.__tobytes(filename)))
|  File "/usr/lib/python3/dist-packages/magic/compat.py", line 138, in __tobytes
|    return bytes(b, 'utf-8')
|UnicodeEncodeError: 'utf-8' codec can't encode character '\udcf0' in position 0: surrogates not allowed

What do you think if we try to use:
|>>> m.file(f.encode('utf-8', errors='surrogateescape'))
In that place?

I.e., the following patch would fix this bug for me.
See also:
https://www.python.org/dev/peps/pep-0383/
https://bugs.python.org/issue21416

|diff --git a/diffoscope/comparators/utils/file.py b/diffoscope/comparators/utils/file.py
|index 4fd49ac..0638ef4 100644
|--- a/diffoscope/comparators/utils/file.py
|+++ b/diffoscope/comparators/utils/file.py
|@@ -68,7 +68,7 @@ class File(object, metaclass=abc.ABCMeta):
|             if not hasattr(self, '_mimedb'):
|                 self._mimedb = magic.open(magic.NONE)
|                 self._mimedb.load()
|-            return self._mimedb.file(path)
|+            return self._mimedb.file(path.encode('utf-8', errors='surrogateescape'))
|
|         @classmethod
|         def guess_encoding(self, path):

Do you think this would be fine?

-- 
regards,
                        Mattia Rizzolo

GPG Key: 66AE 2B4A FCCF 3F52 DA18  4D18 4B04 3FCD B944 4540      .''`.
more about me:  https://mapreri.org                             : :'  :
Launchpad user: https://launchpad.net/~mapreri                  `. `'`
Debian QA page: https://qa.debian.org/developer.php?login=mattia  `-
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://alioth-lists.debian.net/pipermail/reproducible-builds/attachments/20180510/606e7b92/attachment.sig>


More information about the Reproducible-builds mailing list