classification
Title: print fails on unicode '\udce5' surrogates not allowed
Type: behavior Stage: committed/rejected
Components: Unicode Versions: Python 3.1
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: Atle.Pedersen, ezio.melotti, pitrou
Priority: normal Keywords:

Created on 2012-01-05 20:12 by Atle.Pedersen, last changed 2012-01-08 21:41 by Atle.Pedersen. This issue is now closed.

Messages (4)
msg150684 - (view) Author: Atle Pedersen (Atle.Pedersen) Date: 2012-01-05 20:12
I've made a short program to traverse file tree and print file names.

for root, dirs, files in os.walk(path):
        for f in files:
                hex = ' '.join(["%02X"%ord(x) for x in f])
                print('file is',hex,f)

This fails with the following file:

file is 67 72 DCE5 6B 61 6C 6C 65 6E 2E 6A 70 67 2E 68 74 6D 6C Traceback (most recent call last):
  File "/home/atle/bin/findpictures.py", line 16, in <module>
    print('file is',hexa,f)
UnicodeEncodeError: 'utf-8' codec can't encode character '\udce5' in position 2: surrogates not allowed

I don't really understand the issue, but this works with Python 2, and fails using 3.1.4 (gentoo: dev-lang/python-3.1.4-r3)

Same code using Python 2.7.2 gives:
('file is', '67 72 E5 6B 61 6C 6C 65 6E 2E 6A 70 67 2E 68 74 6D 6C', 'gr\xe5kallen.jpg.html')
msg150685 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2012-01-05 20:23
On Python 3, os.walk() uses the surrogateescape error handler.  If the filename is in e.g. iso-8859-* and the filesystem encoding is UTF-8, decoding '\xe5' will then result in '\udce5', and '\udce5' can't then be printed because it's a lone surrogate.

See also http://docs.python.org/dev/library/os.html#file-names-command-line-arguments-and-environment-variables
msg150686 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-01-05 20:23
The file tree contains a file which has an undecodable character in it. It ends up mangled as specified in PEP 383.
Priting such filenames is not directly supported (since they have invalid characters in them), but you can workaround it in several ways, for example escaping all non-ASCII chars: `print(ascii(f))`.

(note that opening the file will still work fine; only outputting the filename without special care will fail)

Python 2 is different since it doesn't attempt to decode filenames at all, it just treats them as opaque bytes.
msg150910 - (view) Author: Atle Pedersen (Atle.Pedersen) Date: 2012-01-08 21:41
Just wanted to say thanks for very fast response, and informative information.

I respect your decision to close the bug as invalid. But my five cent is that it still feels like a bug, something that shouldn't happen. Especially since it's part of a very basic function, and very unpredictable for inexperienced Python programmers.

I do understand your headache. I've had my share of character set issues in my time.

But thanks again for the quick reply, and suggested workarounds, which will work well for me and my situation.
History
Date User Action Args
2012-01-08 21:41:17Atle.Pedersensetmessages: + msg150910
2012-01-05 20:23:42pitrousetnosy: + pitrou
messages: + msg150686
2012-01-05 20:23:12ezio.melottisetstatus: open -> closed
resolution: not a bug
messages: + msg150685

stage: committed/rejected
2012-01-05 20:12:52Atle.Pedersencreate