Message 311634 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	benjamin.peterson
Recipients	benjamin.peterson, ezio.melotti, vstinner
Date	2018-02-05.02:19:58
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1517797199.65.0.467229070634.issue32771@psf.upfronthosting.co.za>
In-reply-to

Content
Both Objects/unicodeobject.c and Modules/unicodedatamodule.c rely on large generated databases (Objects/unicodetype_db.h, Modules/unicodename_db.h, Modules/unicodedata_db.h). This separation made sense in Python 2 where Unicode was less of an important part of the language than Python3-recall Python 2's configure script has --without-unicode!. However, in Python 3, Unicode is a core language concept and literally baked into the syntax of the language. I therefore propose moving all of unicodedata's tables and algorithms into the interpreter core proper and converting Modules/unicodedata.c into a facade. This will remove awkward maneuvers like ast.c importing unicodedata in order to perform normalization. Having unicodedata readily accessible to the str type would also permit higher a fidelity unicode implementation. For example, implementing language-tailored str.lower() requires having canonical combining class of a character available. This data lives only in unicodedata currently.

Both Objects/unicodeobject.c and Modules/unicodedatamodule.c rely on large generated databases (Objects/unicodetype_db.h, Modules/unicodename_db.h, Modules/unicodedata_db.h). This separation made sense in Python 2 where Unicode was less of an important part of the language than Python3-recall Python 2's configure script has --without-unicode!. However, in Python 3, Unicode is a core language concept and literally baked into the syntax of the language. I therefore propose moving all of unicodedata's tables and algorithms into the interpreter core proper and converting Modules/unicodedata.c into a facade. This will remove awkward maneuvers like ast.c importing unicodedata in order to perform normalization. Having unicodedata readily accessible to the str type would also permit higher a fidelity unicode implementation. For example, implementing language-tailored str.lower() requires having canonical combining class of a character available. This data lives only in unicodedata currently.

History
Date	User	Action	Args
2018-02-05 02:19:59	benjamin.peterson	set	recipients: + benjamin.peterson, vstinner, ezio.melotti
2018-02-05 02:19:59	benjamin.peterson	set	messageid: <1517797199.65.0.467229070634.issue32771@psf.upfronthosting.co.za>
2018-02-05 02:19:59	benjamin.peterson	link	issue32771 messages
2018-02-05 02:19:58	benjamin.peterson	create