Issue 25880: codecs should raise specific UnicodeDecodeError/UnicodeEncodeError rather than just UnicodeError

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/70068

classification

Title:	codecs should raise specific UnicodeDecodeError/UnicodeEncodeError rather than just UnicodeError
Type:		Stage:	resolved
Components:	Unicode	Versions:	Python 3.11

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	SilentGhost, ezio.melotti, lemburg, loewis, r.david.murray, serhiy.storchaka, spaceone, vstinner
Priority:	normal	Keywords:

Created on 2015-12-16 08:03 by spaceone, last changed 2022-04-11 14:58 by admin.

Messages (12)
msg256514 - (view)	Author: SpaceOne (spaceone) *	Date: 2015-12-16 08:03
Python 3.4.2 (default, Oct 8 2014, 10:45:20) >>> u'..'.encode('idna') Traceback (most recent call last): File "/usr/lib/python3.4/encodings/idna.py", line 165, in encode raise UnicodeError("label empty or too long") UnicodeError: label empty or too long The above exception was the direct cause of the following exception: Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeError: encoding with 'idna' codec failed (UnicodeError: label empty or too long) → I was expecting that this raises either not at all or UnicodeEncodeError. >>> b'..'.decode('idna') '..' → Why doesn't this raise then, too? The error message is also messed up which wasn't the case in python 2.7. It could be cleaned up.
msg256519 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2015-12-16 14:01
The error message is accurate. That string has empty label segments in it, which RFC 5890 defines as an error on encoding. There is no such error defined for decoding, so that doesn't raise an error. I don't see anything wrong with the error message, it includes the same one as raised in python2. Perhaps you are confused by the error chaining introduced in Python3? The second part of the traceback is coming from the encoding machinery, while the first part lets you know where in the encoder the error was raised. In this case having both doesn't provide much additional information, but if one was debugging a codec or the error were coming from inside an application, it would.
msg256520 - (view)	Author: SpaceOne (spaceone) *	Date: 2015-12-16 14:05
But why is the error UnicodeError instead of UnicodeEncodeError?
msg256521 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2015-12-16 14:21
Why does it matter? If you want to suggest changing it, you could propose a patch. Maybe in reading the code you'll find out why it is the way it is now. I haven't looked at that code in a while myself, so I don't remember if there is a reason or not :)
msg256594 - (view)	Author: SpaceOne (spaceone) *	Date: 2015-12-17 10:17
It makes error handling really hard. Here is a patch: https://github.com/python/cpython/compare/master...spaceone:idna?expand=1
msg256605 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2015-12-17 13:39
Can you explain why it makes error handling hard? I'm still not seeing the use case. I've always viewed UnicodeEncodeError vs UnicodeDecodeError as "extra" information for the consumer of the error message, not something that matters in code (I just catch UnicodeError). I'm not objecting to the change, but it might be nice to know why Martin chose plain UnicodeError, if he's got the time to answer.
msg256606 - (view)	Author: SpaceOne (spaceone) *	Date: 2015-12-17 13:42
Because i need to do everywhere where I use this: try: user_input.encode(encoding) except UnicodeDecodeError: raise except (UnicodeError, UnicodeEncodeError): do_my_error_handling() instead of try: user_input.encode(encoding) except UnicodeEncodeError: do_my_error_handling()
msg256608 - (view)	Author: SilentGhost (SilentGhost) *	Date: 2015-12-17 16:26
I think what David was trying to say is that you could do try: user_input.encode(encoding) except UnicodeError: do_my_error_handling() since UnicodeError is a super class of UnicodeDecodeError and UnicodeEncodeError.
msg256609 - (view)	Author: SpaceOne (spaceone) *	Date: 2015-12-17 16:35
I know that UnicodeEncodeError is a subclass of UnicodeError. The problem here is that UnicodeError would also catch UnicodeDecodeError. This is especially disturbing if you catch errors of a whole function. If you e.g. use python2.7 you might want to catch only UnicodeEncodeError if you encode something and don't want to catch UnicodeDecodeError. >>> b'\xff'.encode('utf-8') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128) (Read that code carefully!!! It's not something which should ever be done but might happen in the world) Especially if you are writing python2+3 compatible applications.
msg256700 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2015-12-18 18:51
The bare UnicodeError is raised also by following codecs: utf_16, utf_32, punycode, undefined, and East-Asian multibyte codecs, and by undocumented an unused function urllib.urlparse.to_bytes(). I think it would be nice to be more specific if possible.
msg256701 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2015-12-18 19:25
I wonder if we originally only had UnicodeError and it got split later but these codecs were never updated. The codecs date back to the start of unicode support in python2, I think. Adding MAL, he's likely to have an opinion on this ;) Oh, right. The more likely possibility is that there was (in python2) no way to know if the operation was (from the user's POV) encoding or decoding when the codec was called. In python3 we do know, when the codec is called via encode/decode, but the codecs are still generic in principle. So yeah, we need MAL's opinion. (Or, I could be completely confused, since I always found encode/decode confusing in python2 :)
msg256704 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2015-12-18 19:45
On 18.12.2015 20:25, R. David Murray wrote: > I wonder if we originally only had UnicodeError and it got split later but these codecs were never updated. The codecs date back to the start of unicode support in python2, I think. UnicodeDecodeError and UnicodeEncodeError were added in Python 2.3 as part of the more flexible error handlers. > Adding MAL, he's likely to have an opinion on this ;) > > Oh, right. The more likely possibility is that there was (in python2) no way to know if the operation was (from the user's POV) encoding or decoding when the codec was called. In python3 we do know, when the codec is called via encode/decode, but the codecs are still generic in principle. So yeah, we need MAL's opinion. (Or, I could be completely confused, since I always found encode/decode confusing in python2 :) There's a clear direction with codecs: - encode: transform to the encoded data - decode: transform back from the encoded data Take e.g. the hex codec. It encodes data into hex format and decodes from hex format back into the original data. The IDNA codecs transforms Unicode domains into the IDNA format (.encode()) and back to Unicode again (.decode()). It was added in Python 2.3 as well, so I guess it was just an overlap/oversight that it was not adapted to the new error classes.

History
Date	User	Action	Args
2022-04-11 14:58:25	admin	set	github: 70068
2021-11-27 00:00:23	iritkatriel	set	title: u'..'.encode('idna') → UnicodeError: label empty or too long -> codecs should raise specific UnicodeDecodeError/UnicodeEncodeError rather than just UnicodeError versions: + Python 3.11, - Python 2.7, Python 3.4
2015-12-18 19:45:26	lemburg	set	messages: + msg256704
2015-12-18 19:25:11	r.david.murray	set	nosy: + lemburg messages: + msg256701
2015-12-18 18:51:49	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg256700
2015-12-17 16:35:29	spaceone	set	messages: + msg256609
2015-12-17 16:26:53	SilentGhost	set	nosy: + SilentGhost messages: + msg256608
2015-12-17 13:42:33	spaceone	set	messages: + msg256606
2015-12-17 13:39:31	r.david.murray	set	nosy: + loewis messages: + msg256605
2015-12-17 10:17:55	spaceone	set	status: closed -> open messages: + msg256594
2015-12-16 14:22:02	r.david.murray	set	status: open -> closed
2015-12-16 14:21:57	r.david.murray	set	messages: + msg256521
2015-12-16 14:05:02	spaceone	set	status: closed -> open resolution: not a bug -> messages: + msg256520
2015-12-16 14:01:39	r.david.murray	set	status: open -> closed nosy: + r.david.murray messages: + msg256519 resolution: not a bug stage: resolved
2015-12-16 08:03:52	spaceone	create