Issue 32771: merge the underlying data stores of unicodedata and the str type

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/76952

classification

Title:	merge the underlying data stores of unicodedata and the str type
Type:	enhancement	Stage:	needs patch
Components:	Interpreter Core, Unicode	Versions:	Python 3.9

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	Greg Price, benjamin.peterson, ezio.melotti, mcepl, serhiy.storchaka, vstinner
Priority:	normal	Keywords:

Created on 2018-02-05 02:19 by benjamin.peterson, last changed 2022-04-11 14:58 by admin.

Messages (11)
msg311634 - (view)	Author: Benjamin Peterson (benjamin.peterson) *	Date: 2018-02-05 02:19
Both Objects/unicodeobject.c and Modules/unicodedatamodule.c rely on large generated databases (Objects/unicodetype_db.h, Modules/unicodename_db.h, Modules/unicodedata_db.h). This separation made sense in Python 2 where Unicode was less of an important part of the language than Python3-recall Python 2's configure script has --without-unicode!. However, in Python 3, Unicode is a core language concept and literally baked into the syntax of the language. I therefore propose moving all of unicodedata's tables and algorithms into the interpreter core proper and converting Modules/unicodedata.c into a facade. This will remove awkward maneuvers like ast.c importing unicodedata in order to perform normalization. Having unicodedata readily accessible to the str type would also permit higher a fidelity unicode implementation. For example, implementing language-tailored str.lower() requires having canonical combining class of a character available. This data lives only in unicodedata currently.
msg311655 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2018-02-05 09:13
+1. And perhaps a new C API for direct access to the Unicode DB should be provided.
msg349546 - (view)	Author: STINNER Victor (vstinner) *	Date: 2019-08-13 12:23
> This will remove awkward maneuvers like ast.c importing unicodedata in order to perform normalization. unicodedata is not needed by default. ast.c only imports unicodedata at the first non-ASCII identifier. If you application (and all dependencies) only use ASCII identifiers, unicodedata is never loaded. Loading it dynamically reduces the memory footprint. Raw measure on my Fedora 30 laptop: $ python3 Python 3.7.4 (default, Jul 9 2019, 16:32:37) [GCC 9.1.1 20190503 (Red Hat 9.1.1-1)] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import os >>> os.system(f"grep ^VmRSS /proc/{os.getpid()}/status") VmRSS: 10236 kB >>> import unicodedata >>> os.system(f"grep ^VmRSS /proc/{os.getpid()}/status") VmRSS: 10396 kB It uses 160 KiB of memory.
msg349547 - (view)	Author: STINNER Victor (vstinner) *	Date: 2019-08-13 12:25
Hum, I forget to mention that the module is compiled as a dynamically library, at least on Fedora: $ python3 Python 3.7.4 (default, Jul 9 2019, 16:32:37) [GCC 9.1.1 20190503 (Red Hat 9.1.1-1)] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import unicodedata >>> unicodedata <module 'unicodedata' from '/usr/lib64/python3.7/lib-dynload/unicodedata.cpython-37m-x86_64-linux-gnu.so'> It's a big file: 1.1 MiB.
msg349614 - (view)	Author: Greg Price (Greg Price) *	Date: 2019-08-13 20:20
> Loading it dynamically reduces the memory footprint. Ah, this is a good question to ask! First, FWIW on my Debian buster desktop I get a smaller figure for `import unicodedata`: only 64 kiB. $ python Python 3.7.3 (default, Apr 3 2019, 05:39:12) [GCC 8.3.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import os >>> os.system(f"grep ^VmRSS /proc/{os.getpid()}/status") VmRSS: 9888 kB >>> import unicodedata >>> os.system(f"grep ^VmRSS /proc/{os.getpid()}/status") VmRSS: 9952 kB But whether 64 kiB or 160 kiB, it's much smaller than the 1.1 MiB of the whole module. Which makes sense -- there's no need to bring the whole thing in memory when we only import it, or generally to bring into memory the parts we aren't using. I wouldn't expect that to change materially if the tables and algorithms were built in. Here's another experiment: suppose we load everything that ast.c needs in order to handle non-ASCII identifiers. $ python Python 3.7.3 (default, Apr 3 2019, 05:39:12) [GCC 8.3.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import os >>> os.system(f"grep ^VmRSS /proc/{os.getpid()}/status") VmRSS: 9800 kB >>> là = 3 >>> os.system(f"grep ^VmRSS /proc/{os.getpid()}/status") VmRSS: 9864 kB So that also comes to 64 kiB. We wouldn't want to add 64 kiB to our memory use for no reason; but I think 64 or 160 kiB is well within the range that's an acceptable cost if it gets us a significant simplification or improvement to core functionality, like Unicode.
msg349616 - (view)	Author: Greg Price (Greg Price) *	Date: 2019-08-13 20:22
Speaking of improving functionality: > Having unicodedata readily accessible to the str type would also permit higher a fidelity unicode implementation. For example, implementing language-tailored str.lower() requires having canonical combining class of a character available. This data lives only in unicodedata currently. Benjamin, can you say more about the behavior you have in mind here? I don't entirely follow. (Is or should there be an issue for it?)
msg349629 - (view)	Author: Benjamin Peterson (benjamin.peterson) *	Date: 2019-08-14 02:24
The goal is to implement the locale-specific case mappings of https://www.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt and §3.13 of the Unicode 12 standard in str.lower/upper/casefold. To do this, you need access to certain character properties available in unicodedata but not the builtin database.
msg349650 - (view)	Author: Greg Price (Greg Price) *	Date: 2019-08-14 06:06
OK, I forked off the discussion of case-mapping as #37848. I think it's probably good to first sort out what we want, before returning to how to implement it (if it's agreed that changes are desired.) Are there other areas of functionality that would be good to add in the core, and require data that's currently only in unicodedata?
msg349670 - (view)	Author: STINNER Victor (vstinner) *	Date: 2019-08-14 10:22
Note: On Debian and Ubuntu, the unicodedata is a built-in module. It's not built as a dynamic library. About the RSS memory, I'm not sure how Linux accounts the Unicode databases before they are accessed. Is it like read-only memory loaded on demand when accessed?
msg349784 - (view)	Author: Benjamin Peterson (benjamin.peterson) *	Date: 2019-08-15 01:22
It's also possible we're missing some logical compression opportunities by artificially partitioning the Unicode databases. Encoded optimally, the combined databases could very well take up less space than their raw sum suggests.
msg349791 - (view)	Author: Greg Price (Greg Price) *	Date: 2019-08-15 04:09
> About the RSS memory, I'm not sure how Linux accounts the Unicode databases before they are accessed. Is it like read-only memory loaded on demand when accessed? It stands for "resident set size", as in "resident in memory"; and it only counts pages of real physical memory. The intention is to count up pages that the process is somehow using. Where the definition potentially gets fuzzy is if this process and another are sharing some memory. I don't know much about how that kind of edge case is handled. But one thing I think it's pretty consistently good at is not counting pages that you've nominally mapped from a file, but haven't actually forced to be loaded physically into memory by actually looking at them. That is: say you ask for a file (or some range of it) to be mapped into memory for you. This means it's now there in the address space, and if the process does a load instruction from any of those addresses, the kernel will ensure the load instruction works seamlessly. But: most of it won't be eagerly read from disk or loaded physically into RAM. Rather, the kernel's counting on that load instruction causing a page fault; and its page-fault handler will take care of reading from the disk and sticking the data physically into RAM. So until you actually execute some loads from those addresses, the data in that mapping doesn't contribute to the genuine demand for scarce physical RAM on the machine; and it also isn't counted in the RSS number. Here's a demo! This 262392 kiB (269 MB) Git packfile is the biggest file lying around in my CPython directory: $ du -k .git/objects/pack/pack-0e4acf3b2d8c21849bb11d875bc14b4d62dc7ab1.pack 262392 .git/objects/pack/pack-0e4acf3b2d8c21849bb11d875bc14b4d62dc7ab1.pack Open it for read -- adds 100 kiB, not sure why: $ python Python 3.7.3 (default, Apr 3 2019, 05:39:12) [GCC 8.3.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import os, mmap >>> os.system(f"grep ^VmRSS /proc/{os.getpid()}/status") VmRSS: 9968 kB >>> fd = os.open('.git/objects/pack/pack-0e4acf3b2d8c21849bb11d875bc14b4d62dc7ab1.pack', os.O_RDONLY) >>> os.system(f"grep ^VmRSS /proc/{os.getpid()}/status") VmRSS: 10068 kB Map it into our address space -- RSS doesn't budge: >>> m = mmap.mmap(fd, 0, prot=mmap.PROT_READ) >>> m <mmap.mmap object at 0x7f185b5379c0> >>> len(m) 268684419 >>> os.system(f"grep ^VmRSS /proc/{os.getpid()}/status") VmRSS: 10068 kB Cause the process to actually look at all the data (this takes about ~10s, too)... >>> sum(len(l) for l in m) 268684419 >>> os.system(f"grep ^VmRSS /proc/{os.getpid()}/status") VmRSS: 271576 kB RSS goes way up, by 261508 kiB! Oddly slightly less (by ~1MB) than the file's size. But wait, there's more. Drop that mapping, and RSS goes right back down (OK, keeps 8 kiB extra): >>> del m >>> os.system(f"grep ^VmRSS /proc/{os.getpid()}/status") VmRSS: 10076 kB ... and then map the exact same file again, and it's still down: >>> m = mmap.mmap(fd, 0, prot=mmap.PROT_READ) >>> os.system(f"grep ^VmRSS /proc/{os.getpid()}/status") VmRSS: 10076 kB This last step is interesting because it's a certainty that the data is still physically in memory -- this is my desktop, with plenty of free RAM. And it's even in our address space. But because we haven't actually loaded from those addresses, it's still in memory only at the kernel's caching whim, and so apparently our process doesn't get "charged" or "blamed" for its presence there. In the case of running an executable with a bunch of data in it, I expect that the bulk of the data (and of the code for that matter) winds up treated very much like the file contents we mmap'd in. It's mapped but not eagerly physically loaded; so it doesn't contribute to the RSS number, nor to the genuine demand for scarce physical RAM on the machine. That's a bit long :-), but hopefully informative. In short, I think for us RSS should work well as a pretty faithful measure of the real memory consumption that we want to be frugal with.

History
Date	User	Action	Args
2022-04-11 14:58:57	admin	set	github: 76952
2019-08-15 04:09:43	Greg Price	set	messages: + msg349791
2019-08-15 01:22:38	benjamin.peterson	set	messages: + msg349784
2019-08-14 10:22:48	vstinner	set	messages: + msg349670
2019-08-14 06:06:48	Greg Price	set	messages: + msg349650
2019-08-14 02:24:36	benjamin.peterson	set	messages: + msg349629
2019-08-13 20:22:31	Greg Price	set	messages: + msg349616 versions: + Python 3.9, - Python 3.8
2019-08-13 20:20:57	Greg Price	set	messages: + msg349614
2019-08-13 12:25:04	vstinner	set	messages: + msg349547
2019-08-13 12:23:29	vstinner	set	messages: + msg349546
2019-08-13 07:56:26	Greg Price	set	nosy: + Greg Price
2018-04-13 16:57:27	mcepl	set	nosy: + mcepl
2018-02-05 09:13:41	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg311655 components: + Interpreter Core
2018-02-05 02:19:59	benjamin.peterson	create