classification
Title: Better start and end position for unicodeerror in unicode_encode_ucs1
Type: enhancement Stage: resolved
Components: Interpreter Core Versions: Python 3.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: serhiy.storchaka Nosy List: haypo, python-dev, serhiy.storchaka, xiang.zhang
Priority: low Keywords: patch

Created on 2016-11-22 14:29 by xiang.zhang, last changed 2017-03-31 16:36 by dstufft. This issue is now closed.

Files
File name Uploaded Description Edit
unicode_encode_ucs1_error_pos.patch xiang.zhang, 2016-11-22 14:29 review
unicode_encode_ucs1_error_pos2.patch serhiy.storchaka, 2016-11-22 15:52 review
Pull Requests
URL Status Linked Edit
PR 552 closed dstufft, 2017-03-31 16:36
Messages (10)
msg281482 - (view) Author: Xiang Zhang (xiang.zhang) * (Python committer) Date: 2016-11-22 14:29
unicode_encode_ucs1 now recognizes as many characters as it can one time instead of one character a time. But the unicodeerror positions still only count 1(the second time). A similar problem reported in #28561.
msg281489 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2016-11-22 15:01
If I understood correctly, the patch fix the ASCII encoder to handle correctly error handlers which return non-ASCII text replacement strings. Right?

I am not aware of such error handler, so I guess that it's a more a theorical fix?

I really hate the code (in each encoder) which handles non-ASCII replacement strings. The code in the charmap encoder is just a mess: it uses a reentrant call to the encoder... I never understood this crazy behaviour. I guess that nobody relies on the behaviour. I hesitate to simply raise an error instead of using different rules depending on the code. Ah yes, by the way, each codec behaves differently on non-ASCII replacement strings...
msg281490 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-11-22 15:52
LGTM. But it is too late for beta 4. I'll commit the patch either after releasing 3.6.0 or in the 3.7 branch only.

And while we are here I noticed that handling non-ASCII replacement string could be simpler.
msg281517 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2016-11-22 21:01
> LGTM. But it is too late for beta 4. I'll commit the patch either after releasing 3.6.0 or in the 3.7 branch only.

Right now, I suggest to only commit into 3.7. Such minor bug can wait for Python 3.6.1.


> And while we are here I noticed that handling non-ASCII replacement string could be simpler

I also suggest to first commit unicode_encode_ucs1_error_pos.patch and then commit the other part of unicode_encode_ucs1_error_pos2.patch in a separated commit. I will be easy to backport the fix to the 3.6 branch later.

Serhiy: Xiang became a core developer, are you ok if he push himself  unicode_encode_ucs1_error_pos.patch to default tomorrow, and later you rebase your patch on top of that?

I'm not super confident because the fix doesn't come with an unit test, but it's ok if Serhiy reviewed it :-)
msg281519 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-11-22 21:12
This is not a bug itself. It seems to me that at worst case the current code is less efficient with non-standard error handler than it can be. I would commit the path to the 3.6 branch before beta 4 as it is nice and simple additional to already added optimization. But it is too late now, at last beta.

Xiang can commit his patch to 3.7. No need to backport it to 3.6 (if I didn't miss something).
msg281520 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2016-11-22 21:16
> No need to backport it to 3.6 (if I didn't miss something).

Sorry, I misunderstood the issue. It's an enhancement, so for 3.7 only.

Right, 3.6 is now almost frozen, only major bug fixes blocking the release are accepted now (in short). Regular bugfixes should wait for 3.6.1.
msg281557 - (view) Author: Roundup Robot (python-dev) Date: 2016-11-23 12:19
New changeset 3d660ed2a60e by Xiang Zhang in branch 'default':
Issue #28774: Fix start/end pos in unicode_encode_ucs1().
https://hg.python.org/cpython/rev/3d660ed2a60e
msg281558 - (view) Author: Xiang Zhang (xiang.zhang) * (Python committer) Date: 2016-11-23 12:22
Thanks Serhiy and Victor. Finished my first commit. :-)

Now assign back to Serhiy and pos2 LGTM.
msg281560 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-11-23 13:07
Congratulations, Xiang!
msg281562 - (view) Author: Roundup Robot (python-dev) Date: 2016-11-23 13:17
New changeset 3addf93f4111 by Serhiy Storchaka in branch 'default':
Issue #28774: Simplified encoding a str result of an error handler in ASCII
https://hg.python.org/cpython/rev/3addf93f4111
History
Date User Action Args
2017-03-31 16:36:29dstufftsetpull_requests: + pull_request1024
2016-11-23 13:19:32serhiy.storchakasetstatus: open -> closed
resolution: fixed
stage: commit review -> resolved
2016-11-23 13:17:31python-devsetmessages: + msg281562
2016-11-23 13:07:24serhiy.storchakasetmessages: + msg281560
2016-11-23 12:22:11xiang.zhangsetassignee: xiang.zhang -> serhiy.storchaka
messages: + msg281558
2016-11-23 12:19:27python-devsetnosy: + python-dev
messages: + msg281557
2016-11-22 21:16:05hayposetmessages: + msg281520
2016-11-22 21:12:32serhiy.storchakasetassignee: serhiy.storchaka -> xiang.zhang
messages: + msg281519
stage: patch review -> commit review
2016-11-22 21:01:33hayposetmessages: + msg281517
2016-11-22 15:52:10serhiy.storchakasetfiles: + unicode_encode_ucs1_error_pos2.patch

messages: + msg281490
2016-11-22 15:01:34hayposetversions: - Python 3.6
2016-11-22 15:01:00hayposetmessages: + msg281489
2016-11-22 14:54:16serhiy.storchakasetpriority: normal -> low
assignee: serhiy.storchaka
2016-11-22 14:29:22xiang.zhangcreate