Issue 42157: Cleanup the unicodedata module

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/86323

classification

Title:	Cleanup the unicodedata module
Type:		Stage:	resolved
Components:	Library (Lib), Unicode	Versions:	Python 3.10

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	ezio.melotti, lemburg, vstinner
Priority:	normal	Keywords:	patch

Created on 2020-10-26 17:05 by vstinner, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Pull Requests
URL	Status	Linked	Edit
PR 22990	merged	vstinner, 2020-10-26 17:57
PR 22991	merged	vstinner, 2020-10-26 18:21
PR 22994	merged	vstinner, 2020-10-26 22:30

Messages (8)
msg379673 - (view)	Author: STINNER Victor (vstinner) *	Date: 2020-10-26 17:05
Mohamed Koubaa and me are trying to convert the unicodedata module to the multi-phase initialization API (PEP 489) and to convert the UCD static type to a heap type in bpo-1635741. The unicodedata extension module has some special cases: * It has a C API exposes in Python as the "unicodedata.ucnhash_CAPI" PyCapsule object. * In C, the unicodedata_functions array is used to define module functions AND unicodedata.UCD methods. It is unused to do that and makes the conversion more tricky. * Most C functions have a "self" parameter which is used to choose between the current version of the Unicode database and the version 3.2.0 ("unicodedata.ucd_3_2_0"). There is also a unicodedata.UCD type which cannot be instanciated in Python. It is only used to create the unicodedata.ucd_3_2_0 instance. In the commit 47e1afd2a1793b5818a16c41307a4ce976331649, I moved the private _PyUnicode_Name_CAPI structure to internal C API. In the commit ddc0dd001a4224274ba6f83568b45a1dd88c6fc6, Mohammed added a ucd_type parameter to the UCD_Check() macro. I asked him to do that. In the commit e6b8c5263a7fcf5b95d0fd4c900e5949eeb6630d, I added a "global module state" and a "state" parameter to most functions. This change prepares the code base to pass a UCD type instance to functions, to be able to have more than once UCD type when it will be converted to a heap type, one type per module instance. The technical problem is that unicodedata_functions is used for module functions and UCD methods. Duplicating unicodedata_functions requires to duplicate a lot of code and comments. Sadly, it does not seem easily possible to retrieve the "module state" ("state" variable) in functions since unicodedata_functions is reused for module functioins and UCD methods. Using "defining_class" in Argument Clinic would require to duplicate all unicodedata_functions functions, one flavor for module functions, one flavor for UCD type. It would also require to duplicate all docstrings, which means to increase the maintenance burden and introduce a risk of having inconsistencies. Maybe we could introduce a new UCD instance which would be mapped to the current Unicode Character Database version, and module functions which be bounded methods of this type. But it sounds overkill to me. By the way, Unicode 3.2 was released in 2002: 18 years ago. I don't think that it's still relevant in 2020 to keep backward compatibility with Unicode 3.2. I propose to deprecate unicodedata.ucd_3_2_0 and deprecate the unicodedate.UCD type. In Python 3.12, we will be able to remove a lot of code, and simplify the code a lot. For now, we can convert unicodedata to the multi-phase initilization API (PEP 489) and convert UCD static type to a heap type by avoiding references to the UCD type. Rather than checking if self is an instance of UCD_Type, we can check if it is not a module (PyModule_Check). This is exactly what Mohammed proposed in the first place, but I misunderstood the whole issue and gave him bad advices.
msg379674 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2020-10-26 17:39
On 26.10.2020 18:05, STINNER Victor wrote: > > By the way, Unicode 3.2 was released in 2002: 18 years ago. I don't think that it's still relevant in 2020 to keep backward compatibility with Unicode 3.2. I propose to deprecate unicodedata.ucd_3_2_0 and deprecate the unicodedate.UCD type. In Python 3.12, we will be able to remove a lot of code, and simplify the code a lot. The version 3.2.0 is needed for IDNA compatibility: IDNA 2003: https://tools.ietf.org/html/rfc3490 requires Unicode 3.2 mappings IDNA 2008: https://tools.ietf.org/html/rfc5890 et al. requires Unicode 5.2+ mappings Python only supports IDNA 2003 AFAIK and the ucs_3_2_0 tag was added by Martin von Löwis to support it even after moving forward to more recent Unicode versions. IDNA 2008 seems to have mechanisms to also work for Unicode versions later than 5.2, but I don't know the details. See this TR for details on how IDNA compatibility is handled: http://www.unicode.org/reports/tr46/ All that said, it may actually be better to deprecate IDNA 2003 support first and direct people to: https://pypi.org/project/idna/ or incorporate this into the stdlib instead of IDNA 2003. The special tag can then be dropped. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Oct 26 2020) >>> Python Projects, Coaching and Support ... https://www.egenix.com/ >>> Python Product Development ... https://consulting.egenix.com/ ________________________________________________________________________ ::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 https://www.egenix.com/company/contact/ https://www.malemburg.com/
msg379678 - (view)	Author: STINNER Victor (vstinner) *	Date: 2020-10-26 18:19
New changeset 920cb647ba23feab7987d0dac1bd63bfc2ffc4c0 by Victor Stinner in branch 'master': bpo-42157: unicodedata avoids references to UCD_Type (GH-22990) https://github.com/python/cpython/commit/920cb647ba23feab7987d0dac1bd63bfc2ffc4c0
msg379693 - (view)	Author: STINNER Victor (vstinner) *	Date: 2020-10-26 22:19
New changeset c8c4200b65b2159bbb13cee10d67dfb3676fef26 by Victor Stinner in branch 'master': bpo-42157: Convert unicodedata.UCD to heap type (GH-22991) https://github.com/python/cpython/commit/c8c4200b65b2159bbb13cee10d67dfb3676fef26
msg379699 - (view)	Author: STINNER Victor (vstinner) *	Date: 2020-10-26 22:51
> By the way, Unicode 3.2 was released in 2002: 18 years ago. I don't think that it's still relevant in 2020 to keep backward compatibility with Unicode 3.2. I propose to deprecate unicodedata.ucd_3_2_0 and deprecate the unicodedate.UCD type. In Python 3.12, we will be able to remove a lot of code, and simplify the code a lot. Oh, it is used by the IDNA encoding (encodings.idna module) and the stringprep module (which is used by the encodings.idna module.
msg379719 - (view)	Author: STINNER Victor (vstinner) *	Date: 2020-10-27 02:59
> The version 3.2.0 is needed for IDNA compatibility (...) Oh, I missed your comment. I also discovered it by trying to remove it :-) So I think that the last thing to do for this issue is to remove unicodedata.ucnhash_CAPI: PR 22994.
msg379727 - (view)	Author: STINNER Victor (vstinner) *	Date: 2020-10-27 03:36
New changeset 84f7382215b9e024a5590454726b6ae4b0ca70a0 by Victor Stinner in branch 'master': bpo-42157: Rename unicodedata.ucnhash_CAPI (GH-22994) https://github.com/python/cpython/commit/84f7382215b9e024a5590454726b6ae4b0ca70a0
msg379728 - (view)	Author: STINNER Victor (vstinner) *	Date: 2020-10-27 03:40
I kept unicodedata.ucd_3_2_0 and added a comment to explain why it's still relevant in 2020. I'm done with tasks listed in this issue, so I close it.

History
Date	User	Action	Args
2022-04-11 14:59:37	admin	set	github: 86323
2020-12-18 00:42:47	vstinner	link	issue15712 superseder
2020-10-27 03:40:56	vstinner	set	status: open -> closed resolution: fixed messages: + msg379728 stage: patch review -> resolved
2020-10-27 03:36:30	vstinner	set	messages: + msg379727
2020-10-27 02:59:19	vstinner	set	nosy: + ezio.melotti messages: + msg379719 components: + Unicode
2020-10-26 22:51:06	vstinner	set	messages: + msg379699
2020-10-26 22:30:53	vstinner	set	pull_requests: + pull_request21908
2020-10-26 22:19:31	vstinner	set	messages: + msg379693
2020-10-26 18:21:34	vstinner	set	pull_requests: + pull_request21906
2020-10-26 18:19:48	vstinner	set	messages: + msg379678
2020-10-26 17:57:16	vstinner	set	keywords: + patch stage: patch review pull_requests: + pull_request21905
2020-10-26 17:39:07	lemburg	set	nosy: + lemburg messages: + msg379674
2020-10-26 17:05:20	vstinner	create