Issue 28774: Better start and end position for unicodeerror in unicode_encode_ucs1

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/72960

classification

Title:	Better start and end position for unicodeerror in unicode_encode_ucs1
Type:	enhancement	Stage:	resolved
Components:	Interpreter Core	Versions:	Python 3.7

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	serhiy.storchaka	Nosy List:	python-dev, serhiy.storchaka, vstinner, xiang.zhang
Priority:	low	Keywords:	patch

Created on 2016-11-22 14:29 by xiang.zhang, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
unicode_encode_ucs1_error_pos.patch	xiang.zhang, 2016-11-22 14:29		review
unicode_encode_ucs1_error_pos2.patch	serhiy.storchaka, 2016-11-22 15:52		review

Pull Requests
URL	Status	Linked	Edit
PR 552	closed	dstufft, 2017-03-31 16:36

Messages (10)
msg281482 - (view)	Author: Xiang Zhang (xiang.zhang) *	Date: 2016-11-22 14:29
unicode_encode_ucs1 now recognizes as many characters as it can one time instead of one character a time. But the unicodeerror positions still only count 1(the second time). A similar problem reported in #28561.
msg281489 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-11-22 15:01
If I understood correctly, the patch fix the ASCII encoder to handle correctly error handlers which return non-ASCII text replacement strings. Right? I am not aware of such error handler, so I guess that it's a more a theorical fix? I really hate the code (in each encoder) which handles non-ASCII replacement strings. The code in the charmap encoder is just a mess: it uses a reentrant call to the encoder... I never understood this crazy behaviour. I guess that nobody relies on the behaviour. I hesitate to simply raise an error instead of using different rules depending on the code. Ah yes, by the way, each codec behaves differently on non-ASCII replacement strings...
msg281490 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2016-11-22 15:52
LGTM. But it is too late for beta 4. I'll commit the patch either after releasing 3.6.0 or in the 3.7 branch only. And while we are here I noticed that handling non-ASCII replacement string could be simpler.
msg281517 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-11-22 21:01
> LGTM. But it is too late for beta 4. I'll commit the patch either after releasing 3.6.0 or in the 3.7 branch only. Right now, I suggest to only commit into 3.7. Such minor bug can wait for Python 3.6.1. > And while we are here I noticed that handling non-ASCII replacement string could be simpler I also suggest to first commit unicode_encode_ucs1_error_pos.patch and then commit the other part of unicode_encode_ucs1_error_pos2.patch in a separated commit. I will be easy to backport the fix to the 3.6 branch later. Serhiy: Xiang became a core developer, are you ok if he push himself unicode_encode_ucs1_error_pos.patch to default tomorrow, and later you rebase your patch on top of that? I'm not super confident because the fix doesn't come with an unit test, but it's ok if Serhiy reviewed it :-)
msg281519 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2016-11-22 21:12
This is not a bug itself. It seems to me that at worst case the current code is less efficient with non-standard error handler than it can be. I would commit the path to the 3.6 branch before beta 4 as it is nice and simple additional to already added optimization. But it is too late now, at last beta. Xiang can commit his patch to 3.7. No need to backport it to 3.6 (if I didn't miss something).
msg281520 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-11-22 21:16
> No need to backport it to 3.6 (if I didn't miss something). Sorry, I misunderstood the issue. It's an enhancement, so for 3.7 only. Right, 3.6 is now almost frozen, only major bug fixes blocking the release are accepted now (in short). Regular bugfixes should wait for 3.6.1.
msg281557 - (view)	Author: Roundup Robot (python-dev)	Date: 2016-11-23 12:19
New changeset 3d660ed2a60e by Xiang Zhang in branch 'default': Issue #28774: Fix start/end pos in unicode_encode_ucs1(). https://hg.python.org/cpython/rev/3d660ed2a60e
msg281558 - (view)	Author: Xiang Zhang (xiang.zhang) *	Date: 2016-11-23 12:22
Thanks Serhiy and Victor. Finished my first commit. :-) Now assign back to Serhiy and pos2 LGTM.
msg281560 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2016-11-23 13:07
Congratulations, Xiang!
msg281562 - (view)	Author: Roundup Robot (python-dev)	Date: 2016-11-23 13:17
New changeset 3addf93f4111 by Serhiy Storchaka in branch 'default': Issue #28774: Simplified encoding a str result of an error handler in ASCII https://hg.python.org/cpython/rev/3addf93f4111

History
Date	User	Action	Args
2022-04-11 14:58:39	admin	set	github: 72960
2017-03-31 16:36:29	dstufft	set	pull_requests: + pull_request1024
2016-11-23 13:19:32	serhiy.storchaka	set	status: open -> closed resolution: fixed stage: commit review -> resolved
2016-11-23 13:17:31	python-dev	set	messages: + msg281562
2016-11-23 13:07:24	serhiy.storchaka	set	messages: + msg281560
2016-11-23 12:22:11	xiang.zhang	set	assignee: xiang.zhang -> serhiy.storchaka messages: + msg281558
2016-11-23 12:19:27	python-dev	set	nosy: + python-dev messages: + msg281557
2016-11-22 21:16:05	vstinner	set	messages: + msg281520
2016-11-22 21:12:32	serhiy.storchaka	set	assignee: serhiy.storchaka -> xiang.zhang messages: + msg281519 stage: patch review -> commit review
2016-11-22 21:01:33	vstinner	set	messages: + msg281517
2016-11-22 15:52:10	serhiy.storchaka	set	files: + unicode_encode_ucs1_error_pos2.patch messages: + msg281490
2016-11-22 15:01:34	vstinner	set	versions: - Python 3.6
2016-11-22 15:01:00	vstinner	set	messages: + msg281489
2016-11-22 14:54:16	serhiy.storchaka	set	priority: normal -> low assignee: serhiy.storchaka
2016-11-22 14:29:22	xiang.zhang	create