Issue 13064: Port codecs and error handlers to the new Unicode API

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/57273

classification

Title:	Port codecs and error handlers to the new Unicode API
Type:	enhancement	Stage:	needs patch
Components:	Unicode	Versions:	Python 3.3

process

Status:	closed	Resolution:	duplicate
Dependencies:		Superseder:
Assigned To:		Nosy List:	ezio.melotti, vstinner
Priority:	normal	Keywords:

Created on 2011-09-29 21:09 by vstinner, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Messages (2)
msg144639 - (view)	Author: STINNER Victor (vstinner) *	Date: 2011-09-29 21:09
We really need a new API for error handlers, using Python objects instead of Py_UNICODE* strings, and using code point indexes instead of UTF-16 unit indexes (index in the Py_UNICODE* object). It's also inefficient to encode to Py_UNICODE at the first encode/decode error. I added private APIs, we may make them public: * _PyUnicode_AsASCIIString() * _PyUnicode_AsLatin1String() * _PyUnicode_AsUTF8String() -- Martin answered me by mail: Would you like to work on this? Some thoughts: - encoding error handlers are easier than decoding, since the encoding error API uses Py_UNICODE* for almost no good reason (except to pass substrings into the exception object, which is better done with PyUnicode_Substring). Decoding has the issue that the error handler may produce a replacement string which then needs to be inserted into the output. - for decoding, I suggest to duplicate the error handling utility function, into one that operates on Unicode objects only. Then port one codec at a time, and ultimately remove the then-unused Py_UNICODE function. - adding an error handler result into a string may cause widening of the string. I can see two approaches: a) write decoders in Py_UCS4. This is perhaps best for the rarely-used codecs, such as UTF-7. b) write the codecs so that they do incremental widening. Start off with a Py_UCS1 buffer, and check each decoded character whether it is out of range. When you get an error handler result, check maxchar and widen the result accordingly. c) in principle, there is a third approach: run over the string once, collect all error handler results. Then allocate the output string, decode again, pasting the replacement strings into the output interleaved with regular decoded chars. This seems too complicated to implement.
msg147774 - (view)	Author: STINNER Victor (vstinner) *	Date: 2011-11-16 22:23
Martin von Loewis implemented this issue, thanks Martin!

History
Date	User	Action	Args
2022-04-11 14:57:22	admin	set	github: 57273
2011-11-17 16:45:34	vstinner	set	status: open -> closed resolution: duplicate
2011-11-16 22:23:33	vstinner	set	messages: + msg147774
2011-09-29 21:11:46	ezio.melotti	set	nosy: + ezio.melotti type: enhancement stage: needs patch
2011-09-29 21:09:07	vstinner	create