This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Resume position for UTF-8 codec error handler not working
Type: behavior Stage: resolved
Components: Versions:
process
Status: closed Resolution: out of date
Dependencies: Superseder:
Assigned To: Nosy List: pgimeno, serhiy.storchaka
Priority: normal Keywords:

Created on 2019-11-14 14:30 by pgimeno, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Messages (4)
msg356610 - (view) Author: Pedro Gimeno (pgimeno) Date: 2019-11-14 14:30
When implementing an error handler, it must return a tuple consisting of a substitution string and a position where to resume decoding. In the case of the UTF-8 codec, the resume position is ignored, and it always resumes immediately after the character that caused the error.

To reproduce, use this code:

import codecs
codecs.register_error('err', lambda err: (b'x', err.end + 1))
assert repr(u'\uDD00yz'.encode('utf8', errors='err')) == b'xz'

The above code fails the assertion because the result is b'xyz'.

It works OK for some other codecs. I have not tried to make an exhaustive list of which ones work and which ones don't, therefore this problem might apply to others.
msg356611 - (view) Author: Pedro Gimeno (pgimeno) Date: 2019-11-14 14:32
I forgot the quotes in the assertion, it should have been "b'xz'".
msg356620 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-11-14 18:13
It works to me (after fixing the assertion).

What Python version do you use? In Python 2 u'\uDD00' is encodable to UTF-8, so the error handler is not called. u'\uDD00yz'.encode('utf8') gives '\xed\xb4\x80yz'.
msg356626 - (view) Author: Pedro Gimeno (pgimeno) Date: 2019-11-14 20:45
Python 3.5 from Debian stretch (oldstable). You're right, I can't reproduce it in 3.7 from Buster. Sorry for the bogus report.
History
Date User Action Args
2022-04-11 14:59:23adminsetgithub: 82981
2019-11-15 08:43:39serhiy.storchakasetstatus: open -> closed
resolution: out of date
stage: resolved
2019-11-14 20:45:21pgimenosetmessages: + msg356626
2019-11-14 18:13:23serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg356620
2019-11-14 14:32:42pgimenosetmessages: + msg356611
2019-11-14 14:30:24pgimenocreate