Title: Report surrogate characters range in utf8_encoder
Type: behavior Stage: resolved
Components: Interpreter Core Versions: Python 3.7, Python 3.6
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: serhiy.storchaka Nosy List: python-dev, serhiy.storchaka, vstinner, xiang.zhang
Priority: normal Keywords: patch

Created on 2016-10-30 07:45 by xiang.zhang, last changed 2016-10-30 16:28 by serhiy.storchaka. This issue is now closed.

File name Uploaded Description Edit
utf8_encoder.patch xiang.zhang, 2016-10-30 07:45 review
utf8_encoder_v2.patch xiang.zhang, 2016-10-30 08:55 review
Messages (3)
msg279712 - (view) Author: Xiang Zhang (xiang.zhang) * (Python committer) Date: 2016-10-30 07:45
In utf8_encoder, when a codecs returns a string with non-ascii characters, it raises encodeerror but the start and end position are not perfect. This seems like an oversight during evolution. Before, utf8_encoder only recognize one surrogate character a time. After 2b5357b38366, it tries to recognize as much as possible a time. Patch also includes some cleanup.
msg279728 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2016-10-30 16:26
New changeset 542065b03c10 by Serhiy Storchaka in branch '3.6':
Issue #28561: Clean up UTF-8 encoder: remove dead code, update comments, etc.

New changeset ee3670d9bda6 by Serhiy Storchaka in branch 'default':
Issue #28561: Clean up UTF-8 encoder: remove dead code, update comments, etc.
msg279729 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-10-30 16:28
Thanks Xiang. Yes, this all is follow up issue25267.
