Issue 13717: print fails on unicode '\udce5' surrogates not allowed

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/57926

classification

Title:	print fails on unicode '\udce5' surrogates not allowed
Type:	behavior	Stage:	resolved
Components:	Unicode	Versions:	Python 3.1

process

Status:	closed	Resolution:	not a bug
Dependencies:		Superseder:
Assigned To:		Nosy List:	Atle.Pedersen, ezio.melotti, pitrou
Priority:	normal	Keywords:

Created on 2012-01-05 20:12 by Atle.Pedersen, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Messages (4)
msg150684 - (view)	Author: Atle Pedersen (Atle.Pedersen)	Date: 2012-01-05 20:12
I've made a short program to traverse file tree and print file names. for root, dirs, files in os.walk(path): for f in files: hex = ' '.join(["%02X"%ord(x) for x in f]) print('file is',hex,f) This fails with the following file: file is 67 72 DCE5 6B 61 6C 6C 65 6E 2E 6A 70 67 2E 68 74 6D 6C Traceback (most recent call last): File "/home/atle/bin/findpictures.py", line 16, in <module> print('file is',hexa,f) UnicodeEncodeError: 'utf-8' codec can't encode character '\udce5' in position 2: surrogates not allowed I don't really understand the issue, but this works with Python 2, and fails using 3.1.4 (gentoo: dev-lang/python-3.1.4-r3) Same code using Python 2.7.2 gives: ('file is', '67 72 E5 6B 61 6C 6C 65 6E 2E 6A 70 67 2E 68 74 6D 6C', 'gr\xe5kallen.jpg.html')
msg150685 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2012-01-05 20:23
On Python 3, os.walk() uses the surrogateescape error handler. If the filename is in e.g. iso-8859-* and the filesystem encoding is UTF-8, decoding '\xe5' will then result in '\udce5', and '\udce5' can't then be printed because it's a lone surrogate. See also http://docs.python.org/dev/library/os.html#file-names-command-line-arguments-and-environment-variables
msg150686 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-01-05 20:23
The file tree contains a file which has an undecodable character in it. It ends up mangled as specified in PEP 383. Priting such filenames is not directly supported (since they have invalid characters in them), but you can workaround it in several ways, for example escaping all non-ASCII chars: `print(ascii(f))`. (note that opening the file will still work fine; only outputting the filename without special care will fail) Python 2 is different since it doesn't attempt to decode filenames at all, it just treats them as opaque bytes.
msg150910 - (view)	Author: Atle Pedersen (Atle.Pedersen)	Date: 2012-01-08 21:41
Just wanted to say thanks for very fast response, and informative information. I respect your decision to close the bug as invalid. But my five cent is that it still feels like a bug, something that shouldn't happen. Especially since it's part of a very basic function, and very unpredictable for inexperienced Python programmers. I do understand your headache. I've had my share of character set issues in my time. But thanks again for the quick reply, and suggested workarounds, which will work well for me and my situation.

History
Date	User	Action	Args
2022-04-11 14:57:25	admin	set	github: 57926
2012-01-08 21:41:17	Atle.Pedersen	set	messages: + msg150910
2012-01-05 20:23:42	pitrou	set	nosy: + pitrou messages: + msg150686
2012-01-05 20:23:12	ezio.melotti	set	status: open -> closed resolution: not a bug messages: + msg150685 stage: resolved
2012-01-05 20:12:52	Atle.Pedersen	create