classification
Title: merge the underlying data stores of unicodedata and the str type
Type: enhancement Stage: needs patch
Components: Interpreter Core, Unicode Versions: Python 3.8
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: benjamin.peterson, ezio.melotti, mcepl, serhiy.storchaka, vstinner
Priority: normal Keywords:

Created on 2018-02-05 02:19 by benjamin.peterson, last changed 2018-04-13 16:57 by mcepl.

Messages (2)
msg311634 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2018-02-05 02:19
Both Objects/unicodeobject.c and Modules/unicodedatamodule.c rely on large generated databases (Objects/unicodetype_db.h, Modules/unicodename_db.h, Modules/unicodedata_db.h). This separation made sense in Python 2 where Unicode was less of an important part of the language than Python3-recall Python 2's configure script has --without-unicode!. However, in Python 3, Unicode is a core language concept and literally baked into the syntax of the language. I therefore propose moving all of unicodedata's tables and algorithms into the interpreter core proper and converting Modules/unicodedata.c into a facade. This will remove awkward maneuvers like ast.c importing unicodedata in order to perform normalization. Having unicodedata readily accessible to the str type would also permit higher a fidelity unicode implementation. For example, implementing language-tailored str.lower() requires having canonical combining class of a character available. This data lives only in unicodedata currently.
msg311655 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-02-05 09:13
+1. And perhaps a new C API for direct access to the Unicode DB should be provided.
History
Date User Action Args
2018-04-13 16:57:27mceplsetnosy: + mcepl
2018-02-05 09:13:41serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg311655
components: + Interpreter Core
2018-02-05 02:19:59benjamin.petersoncreate