Message 356610 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	pgimeno
Recipients	pgimeno
Date	2019-11-14.14:30:24
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1573741824.98.0.130776319775.issue38800@roundup.psfhosted.org>
In-reply-to

Content
When implementing an error handler, it must return a tuple consisting of a substitution string and a position where to resume decoding. In the case of the UTF-8 codec, the resume position is ignored, and it always resumes immediately after the character that caused the error. To reproduce, use this code: import codecs codecs.register_error('err', lambda err: (b'x', err.end + 1)) assert repr(u'\uDD00yz'.encode('utf8', errors='err')) == b'xz' The above code fails the assertion because the result is b'xyz'. It works OK for some other codecs. I have not tried to make an exhaustive list of which ones work and which ones don't, therefore this problem might apply to others.

When implementing an error handler, it must return a tuple consisting of a substitution string and a position where to resume decoding. In the case of the UTF-8 codec, the resume position is ignored, and it always resumes immediately after the character that caused the error.

To reproduce, use this code:

import codecs
codecs.register_error('err', lambda err: (b'x', err.end + 1))
assert repr(u'\uDD00yz'.encode('utf8', errors='err')) == b'xz'

The above code fails the assertion because the result is b'xyz'.

It works OK for some other codecs. I have not tried to make an exhaustive list of which ones work and which ones don't, therefore this problem might apply to others.

History
Date	User	Action	Args
2019-11-14 14:30:25	pgimeno	set	recipients: + pgimeno
2019-11-14 14:30:24	pgimeno	set	messageid: <1573741824.98.0.130776319775.issue38800@roundup.psfhosted.org>
2019-11-14 14:30:24	pgimeno	link	issue38800 messages
2019-11-14 14:30:24	pgimeno	create