classification
Title: merge the underlying data stores of unicodedata and the str type
Type: enhancement Stage: needs patch
Components: Interpreter Core, Unicode Versions: Python 3.9
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Greg Price, benjamin.peterson, ezio.melotti, mcepl, serhiy.storchaka, vstinner
Priority: normal Keywords:

Created on 2018-02-05 02:19 by benjamin.peterson, last changed 2019-08-15 04:09 by Greg Price.

Messages (11)
msg311634 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2018-02-05 02:19
Both Objects/unicodeobject.c and Modules/unicodedatamodule.c rely on large generated databases (Objects/unicodetype_db.h, Modules/unicodename_db.h, Modules/unicodedata_db.h). This separation made sense in Python 2 where Unicode was less of an important part of the language than Python3-recall Python 2's configure script has --without-unicode!. However, in Python 3, Unicode is a core language concept and literally baked into the syntax of the language. I therefore propose moving all of unicodedata's tables and algorithms into the interpreter core proper and converting Modules/unicodedata.c into a facade. This will remove awkward maneuvers like ast.c importing unicodedata in order to perform normalization. Having unicodedata readily accessible to the str type would also permit higher a fidelity unicode implementation. For example, implementing language-tailored str.lower() requires having canonical combining class of a character available. This data lives only in unicodedata currently.
msg311655 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-02-05 09:13
+1. And perhaps a new C API for direct access to the Unicode DB should be provided.
msg349546 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-08-13 12:23
> This will remove awkward maneuvers like ast.c importing unicodedata in order to perform normalization.

unicodedata is not needed by default. ast.c only imports unicodedata at the first non-ASCII identifier. If you application (and all dependencies) only use ASCII identifiers, unicodedata is never loaded. Loading it dynamically reduces the memory footprint. 

Raw measure on my Fedora 30 laptop:

$ python3
Python 3.7.4 (default, Jul  9 2019, 16:32:37) 
[GCC 9.1.1 20190503 (Red Hat 9.1.1-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.system(f"grep ^VmRSS /proc/{os.getpid()}/status")
VmRSS:	   10236 kB

>>> import unicodedata
>>> os.system(f"grep ^VmRSS /proc/{os.getpid()}/status")
VmRSS:	   10396 kB

It uses 160 KiB of memory.
msg349547 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-08-13 12:25
Hum, I forget to mention that the module is compiled as a dynamically library, at least on Fedora:

$ python3
Python 3.7.4 (default, Jul  9 2019, 16:32:37) 
[GCC 9.1.1 20190503 (Red Hat 9.1.1-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import unicodedata
>>> unicodedata
<module 'unicodedata' from '/usr/lib64/python3.7/lib-dynload/unicodedata.cpython-37m-x86_64-linux-gnu.so'>

It's a big file: 1.1 MiB.
msg349614 - (view) Author: Greg Price (Greg Price) * Date: 2019-08-13 20:20
> Loading it dynamically reduces the memory footprint.

Ah, this is a good question to ask!

First, FWIW on my Debian buster desktop I get a smaller figure for `import unicodedata`: only 64 kiB.

$ python
Python 3.7.3 (default, Apr  3 2019, 05:39:12) 
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.system(f"grep ^VmRSS /proc/{os.getpid()}/status")
VmRSS:	    9888 kB

>>> import unicodedata
>>> os.system(f"grep ^VmRSS /proc/{os.getpid()}/status")
VmRSS:	    9952 kB

But whether 64 kiB or 160 kiB, it's much smaller than the 1.1 MiB of the whole module.  Which makes sense -- there's no need to bring the whole thing in memory when we only import it, or generally to bring into memory the parts we aren't using.  I wouldn't expect that to change materially if the tables and algorithms were built in.

Here's another experiment: suppose we load everything that ast.c needs in order to handle non-ASCII identifiers.

$ python
Python 3.7.3 (default, Apr  3 2019, 05:39:12) 
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.system(f"grep ^VmRSS /proc/{os.getpid()}/status")
VmRSS:	    9800 kB

>>> là = 3
>>> os.system(f"grep ^VmRSS /proc/{os.getpid()}/status")
VmRSS:	    9864 kB

So that also comes to 64 kiB.

We wouldn't want to add 64 kiB to our memory use for no reason; but I think 64 or 160 kiB is well within the range that's an acceptable cost if it gets us a significant simplification or improvement to core functionality, like Unicode.
msg349616 - (view) Author: Greg Price (Greg Price) * Date: 2019-08-13 20:22
Speaking of improving functionality:

> Having unicodedata readily accessible to the str type would also permit higher a fidelity unicode implementation. For example, implementing language-tailored str.lower() requires having canonical combining class of a character available. This data lives only in unicodedata currently.

Benjamin, can you say more about the behavior you have in mind here? I don't entirely follow. (Is or should there be an issue for it?)
msg349629 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2019-08-14 02:24
The goal is to implement the locale-specific case mappings of https://www.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt and §3.13 of the Unicode 12 standard in str.lower/upper/casefold. To do this, you need access to certain character properties available in unicodedata but not the builtin database.
msg349650 - (view) Author: Greg Price (Greg Price) * Date: 2019-08-14 06:06
OK, I forked off the discussion of case-mapping as #37848. I think it's probably good to first sort out what we want, before returning to how to implement it (if it's agreed that changes are desired.)

Are there other areas of functionality that would be good to add in the core, and require data that's currently only in unicodedata?
msg349670 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-08-14 10:22
Note: On Debian and Ubuntu, the unicodedata is a built-in module. It's not built as a dynamic library. About the RSS memory, I'm not sure how Linux accounts the Unicode databases before they are accessed. Is it like read-only memory loaded on demand when accessed?
msg349784 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2019-08-15 01:22
It's also possible we're missing some logical compression opportunities by artificially partitioning the Unicode databases. Encoded optimally, the combined databases could very well take up less space than their raw sum suggests.
msg349791 - (view) Author: Greg Price (Greg Price) * Date: 2019-08-15 04:09
> About the RSS memory, I'm not sure how Linux accounts the Unicode databases before they are accessed. Is it like read-only memory loaded on demand when accessed?

It stands for "resident set size", as in "resident in memory"; and it only counts pages of real physical memory. The intention is to count up pages that the process is somehow using.

Where the definition potentially gets fuzzy is if this process and another are sharing some memory.  I don't know much about how that kind of edge case is handled.  But one thing I think it's pretty consistently good at is not counting pages that you've nominally mapped from a file, but haven't actually forced to be loaded physically into memory by actually looking at them.

That is: say you ask for a file (or some range of it) to be mapped into memory for you.  This means it's now there in the address space, and if the process does a load instruction from any of those addresses, the kernel will ensure the load instruction works seamlessly.  But: most of it won't be eagerly read from disk or loaded physically into RAM.  Rather, the kernel's counting on that load instruction causing a page fault; and its page-fault handler will take care of reading from the disk and sticking the data physically into RAM.  So until you actually execute some loads from those addresses, the data in that mapping doesn't contribute to the genuine demand for scarce physical RAM on the machine; and it also isn't counted in the RSS number.


Here's a demo!  This 262392 kiB (269 MB) Git packfile is the biggest file lying around in my CPython directory:

$ du -k .git/objects/pack/pack-0e4acf3b2d8c21849bb11d875bc14b4d62dc7ab1.pack
262392	.git/objects/pack/pack-0e4acf3b2d8c21849bb11d875bc14b4d62dc7ab1.pack


Open it for read -- adds 100 kiB, not sure why:

$ python
Python 3.7.3 (default, Apr  3 2019, 05:39:12) 
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import os, mmap
>>> os.system(f"grep ^VmRSS /proc/{os.getpid()}/status")
VmRSS:	    9968 kB
>>> fd = os.open('.git/objects/pack/pack-0e4acf3b2d8c21849bb11d875bc14b4d62dc7ab1.pack', os.O_RDONLY)
>>> os.system(f"grep ^VmRSS /proc/{os.getpid()}/status")
VmRSS:	   10068 kB


Map it into our address space -- RSS doesn't budge:

>>> m = mmap.mmap(fd, 0, prot=mmap.PROT_READ)
>>> m
<mmap.mmap object at 0x7f185b5379c0>
>>> len(m)
268684419
>>> os.system(f"grep ^VmRSS /proc/{os.getpid()}/status")
VmRSS:	   10068 kB


Cause the process to actually look at all the data (this takes about ~10s, too)...

>>> sum(len(l) for l in m)
268684419
>>> os.system(f"grep ^VmRSS /proc/{os.getpid()}/status")
VmRSS:	  271576 kB

RSS goes way up, by 261508 kiB!  Oddly slightly less (by ~1MB) than the file's size.


But wait, there's more. Drop that mapping, and RSS goes right back down (OK, keeps 8 kiB extra):

>>> del m
>>> os.system(f"grep ^VmRSS /proc/{os.getpid()}/status")
VmRSS:	   10076 kB

... and then map the exact same file again, and it's *still* down:

>>> m = mmap.mmap(fd, 0, prot=mmap.PROT_READ)
>>> os.system(f"grep ^VmRSS /proc/{os.getpid()}/status")
VmRSS:	   10076 kB

This last step is interesting because it's a certainty that the data is still physically in memory -- this is my desktop, with plenty of free RAM.  And it's even in our address space.  But because we haven't actually loaded from those addresses, it's still in memory only at the kernel's caching whim, and so apparently our process doesn't get "charged" or "blamed" for its presence there.


In the case of running an executable with a bunch of data in it, I expect that the bulk of the data (and of the code for that matter) winds up treated very much like the file contents we mmap'd in.  It's mapped but not eagerly physically loaded; so it doesn't contribute to the RSS number, nor to the genuine demand for scarce physical RAM on the machine.


That's a bit long :-), but hopefully informative.  In short, I think for us RSS should work well as a pretty faithful measure of the real memory consumption that we want to be frugal with.
History
Date User Action Args
2019-08-15 04:09:43Greg Pricesetmessages: + msg349791
2019-08-15 01:22:38benjamin.petersonsetmessages: + msg349784
2019-08-14 10:22:48vstinnersetmessages: + msg349670
2019-08-14 06:06:48Greg Pricesetmessages: + msg349650
2019-08-14 02:24:36benjamin.petersonsetmessages: + msg349629
2019-08-13 20:22:31Greg Pricesetmessages: + msg349616
versions: + Python 3.9, - Python 3.8
2019-08-13 20:20:57Greg Pricesetmessages: + msg349614
2019-08-13 12:25:04vstinnersetmessages: + msg349547
2019-08-13 12:23:29vstinnersetmessages: + msg349546
2019-08-13 07:56:26Greg Pricesetnosy: + Greg Price
2018-04-13 16:57:27mceplsetnosy: + mcepl
2018-02-05 09:13:41serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg311655
components: + Interpreter Core
2018-02-05 02:19:59benjamin.petersoncreate