classification
Title: Prepare for removing the legacy Unicode C API
Type: Stage: patch review
Components: Interpreter Core, Unicode Versions: Python 3.8
process
Status: open Resolution:
Dependencies: 36387 Superseder:
Assigned To: Nosy List: ezio.melotti, inada.naoki, lemburg, pitrou, ronaldoussoren, scoder, serhiy.storchaka
Priority: normal Keywords: patch

Created on 2019-03-18 14:23 by serhiy.storchaka, last changed 2019-04-10 16:49 by vstinner.

Pull Requests
URL Status Linked Edit
PR 12409 open serhiy.storchaka, 2019-03-18 14:27
Messages (12)
msg338228 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-03-18 14:23
The legacy Unicode C API was deprecated in 3.3. Its support consumes resources: more memory usage by Unicode objects, additional code for handling Unicode objects created with the legacy C API. Currently every Unicode object has a cache for the wchar_t representation.

The proposed PR adds two compile time options: HAVE_UNICODE_WCHAR_CACHE and USE_UNICODE_WCHAR_CACHE. Both are set to 1 by default.

If USE_UNICODE_WCHAR_CACHE is set to 0, CPython will not use the wchar_t cache internally. The new wchar_t based C API will be used instead of the Py_UNICODE based C API. This can add small performance penalty for creating a temporary buffer for the wchar_t representation. On other hand, this will decrease the long-term memory usage. This build is binary compatible with the standard build and third-party extensions can use the legacy Unicode C API.

If HAVE_UNICODE_WCHAR_CACHE is set to 0, the wchar_t cache will be completely removed. The legacy Unicode C API will be not available, and functions that need it (e.g. PyArg_ParseTuple() with the "u" format unit) will always fail. This build is binary incompatible with the standard build if you use the legacy or non-stable Unicode C API.

I hope that these options will help third-party projects to prepare for removing the legacy Unicode C API in future.
msg338284 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2019-03-18 20:05
Thanks for implementing this, Serhiy.
Since these C macros are public, should they be named PY_* ?
msg338285 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2019-03-18 20:46
I think this is a good preparation that makes it clear what code will eventually be removed, and allows testing without it.

No idea how happy Windows users will be about all of this, but I consider it quite an overall improvement for the Unicode implementation. Once this gets removed, that is.

Removing the "unicode_internal" codec entirely (which is changed by this PR) is discussed in issue36297.
msg338286 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2019-03-18 21:06
I'd change the title of this bpo item to "Prepare for removing the whcar_t caching in the Unicode C API".

Note that the wchar_t caching was put in place to allow for external applications and C code to easily and efficiently interface with Python. By removing it you will slow down such code significantly, esp. on Linux and Windows where wchar_t code is fairly common (one of the reasons we added UCS4 in Python was to make the interaction with Linux wchar_t code more efficient).

This should be clearly mentioned as part of the change and the compile time flags.


BTW: You have a few other changes in the PR which don't have anything to do with the intended removal:

-    envsize = PySequence_Fast_GET_SIZE(keys);
-    if (PySequence_Fast_GET_SIZE(values) != envsize) {
+    envsize = PyList_GET_SIZE(keys);
+    if (PyList_GET_SIZE(values) != envsize) {
msg338289 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2019-03-18 21:33
I had also looked through the unrelated changes, and while, yes, they are unrelated, they seemed to be correct and reasonable modernisations of the code base while touching it. They could be moved to a separate PR, but there is a relatively high risk of conflicts, so I'm ok with keeping them in here for now.
msg338290 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2019-03-18 21:53
On 18.03.2019 22:33, Stefan Behnel wrote:
> 
> I had also looked through the unrelated changes, and while, yes, they are unrelated, they seemed to be correct and reasonable modernisations of the code base while touching it. They could be moved to a separate PR, but there is a relatively high risk of conflicts, so I'm ok with keeping them in here for now.

I don't think changing sequence iteration to list iteration only
is something that should be hidden in a wchar_t removal PR.

My guess is that these changes have made it into the PR by mistake.
They deserve a separate PR and discussion.
msg338331 - (view) Author: Inada Naoki (inada.naoki) * (Python committer) Date: 2019-03-19 08:58
I'm not sure we need two options.
Does USE_UNICODE_WCHAR_CACHE=0 really helps preparing to the removal?
msg338340 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-03-19 10:45
I wrote this PR just to see how much code should be changed after removing the wchar_t cache, and what be performance impact. Get it, experiment with it, run tests and benchmarks. I think we could set USE_UNICODE_WCHAR_CACHE to 0 by default. If this will cause significant troubles, it is easy to set it to 1.

I am going to add configure options for switching these options. On Windows you will still need to edit the config file manually.

> I'm not sure we need two options.
> Does USE_UNICODE_WCHAR_CACHE=0 really helps preparing to the removal?

Currently some of the legacy functions are not decorated with Py_DEPRECATED, because this would cause compiler warnings in the code that uses these functions. If USE_UNICODE_WCHAR_CACHE is 0, these functions will no longer used, so we can add compiler warnings for them.

> I don't think changing sequence iteration to list iteration only
> is something that should be hidden in a wchar_t removal PR.

getenvironment() is the function that has been rewritten to the new API without preserving the old variant. Since the code was rewritten so much, I performed some code clean up. PyMapping_Keys() and PyMapping_Values() always return a list now, so that using the PySequence_Fast API is superfluous. They could return a tuple in the past, but this provoked bugs because the user code used PyList API for it.

I'll open a separate issue for this.

> Since these C macros are public, should they be named PY_* ?

CPython configuration macros (like HAVE_ACOSH or USE_COMPUTED_GOTOS) do not have the PY_ prefix.
msg338343 - (view) Author: Inada Naoki (inada.naoki) * (Python committer) Date: 2019-03-19 11:21
FYI, I had created PR 12340 which removes use of deprecated API in ctypes.
msg338344 - (view) Author: Ronald Oussoren (ronaldoussoren) * (Python committer) Date: 2019-03-19 11:47
One thing to keep in mind: HAVE_UNICODE_WCHAR_CACHE == 1 and HAVE_UNICODE_WCHAR_CACHE == 0 have a different ABI due to a different struct layout. This should probably affect the ABI tag for extension modules.
msg338565 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2019-03-21 19:39
> The proposed PR adds two compile time options: HAVE_UNICODE_WCHAR_CACHE and USE_UNICODE_WCHAR_CACHE

I don't think this is a good approach.  Most projects and developers don't recompile Python.  It's especially a chore when you have many dependencies with C extensions, because you'll have to recompile them all as well.

I would recommend simply removing that cache.
msg339860 - (view) Author: Inada Naoki (inada.naoki) * (Python committer) Date: 2019-04-10 12:57
I think these ABI incompatible options are used many people.
But it is helpful to find extensions which using legacy APIs before Python 3.10 is released.

I had found ujson and MarkupSafe used legacy APIs.  I fixed MarkupSafe.
I don't care ujson because it is wrapper of wchar_t based C library
and there are enough json libraries.

I suppose there are some other packages in PyPI, but I'm not sure.
History
Date User Action Args
2019-04-10 16:49:37vstinnersetnosy: - vstinner
2019-04-10 12:57:16inada.naokisetmessages: + msg339860
2019-03-21 19:39:00pitrousetnosy: + pitrou
messages: + msg338565
2019-03-21 06:40:17serhiy.storchakasetdependencies: + Refactor getenvironment() in _winapi.c
2019-03-19 11:47:07ronaldoussorensetnosy: + ronaldoussoren
messages: + msg338344
2019-03-19 11:21:51inada.naokisetmessages: + msg338343
2019-03-19 10:45:21serhiy.storchakasetmessages: + msg338340
2019-03-19 08:58:10inada.naokisetmessages: + msg338331
2019-03-18 21:53:01lemburgsetmessages: + msg338290
2019-03-18 21:33:35scodersetmessages: + msg338289
2019-03-18 21:06:40lemburgsetnosy: + lemburg
messages: + msg338286
2019-03-18 20:46:26scodersetmessages: + msg338285
2019-03-18 20:05:26scodersetmessages: + msg338284
2019-03-18 20:01:02scodersetnosy: + scoder
2019-03-18 14:27:35serhiy.storchakasetkeywords: + patch
stage: patch review
pull_requests: + pull_request12363
2019-03-18 14:23:01serhiy.storchakacreate