classification
Title: Prepare for removing the legacy Unicode C API
Type: Stage: patch review
Components: C API, Interpreter Core, Unicode Versions: Python 3.8
process
Status: open Resolution:
Dependencies: 36387 Superseder:
Assigned To: Nosy List: corona10, ezio.melotti, lemburg, methane, miss-islington, pitrou, ronaldoussoren, scoder, serhiy.storchaka, shihai1991, vstinner
Priority: normal Keywords: patch

Created on 2019-03-18 14:23 by serhiy.storchaka, last changed 2020-08-05 01:57 by miss-islington.

Pull Requests
URL Status Linked Edit
PR 12409 open serhiy.storchaka, 2019-03-18 14:27
PR 19653 merged methane, 2020-04-22 13:14
PR 20878 merged methane, 2020-06-15 01:15
PR 20879 open methane, 2020-06-15 01:46
PR 20927 open methane, 2020-06-17 05:12
PR 20932 closed miss-islington, 2020-06-17 11:10
PR 20933 merged methane, 2020-06-17 11:33
PR 20934 merged methane, 2020-06-17 12:31
PR 20941 merged methane, 2020-06-17 15:06
PR 21223 merged serhiy.storchaka, 2020-06-29 20:27
PR 21336 merged serhiy.storchaka, 2020-07-05 15:18
PR 21429 merged serhiy.storchaka, 2020-07-10 07:36
PR 21437 merged serhiy.storchaka, 2020-07-10 15:47
PR 21479 merged methane, 2020-07-15 05:11
PR 21738 merged miss-islington, 2020-08-05 01:49
PR 21739 merged miss-islington, 2020-08-05 01:49
Messages (30)
msg338228 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-03-18 14:23
The legacy Unicode C API was deprecated in 3.3. Its support consumes resources: more memory usage by Unicode objects, additional code for handling Unicode objects created with the legacy C API. Currently every Unicode object has a cache for the wchar_t representation.

The proposed PR adds two compile time options: HAVE_UNICODE_WCHAR_CACHE and USE_UNICODE_WCHAR_CACHE. Both are set to 1 by default.

If USE_UNICODE_WCHAR_CACHE is set to 0, CPython will not use the wchar_t cache internally. The new wchar_t based C API will be used instead of the Py_UNICODE based C API. This can add small performance penalty for creating a temporary buffer for the wchar_t representation. On other hand, this will decrease the long-term memory usage. This build is binary compatible with the standard build and third-party extensions can use the legacy Unicode C API.

If HAVE_UNICODE_WCHAR_CACHE is set to 0, the wchar_t cache will be completely removed. The legacy Unicode C API will be not available, and functions that need it (e.g. PyArg_ParseTuple() with the "u" format unit) will always fail. This build is binary incompatible with the standard build if you use the legacy or non-stable Unicode C API.

I hope that these options will help third-party projects to prepare for removing the legacy Unicode C API in future.
msg338284 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2019-03-18 20:05
Thanks for implementing this, Serhiy.
Since these C macros are public, should they be named PY_* ?
msg338285 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2019-03-18 20:46
I think this is a good preparation that makes it clear what code will eventually be removed, and allows testing without it.

No idea how happy Windows users will be about all of this, but I consider it quite an overall improvement for the Unicode implementation. Once this gets removed, that is.

Removing the "unicode_internal" codec entirely (which is changed by this PR) is discussed in issue36297.
msg338286 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2019-03-18 21:06
I'd change the title of this bpo item to "Prepare for removing the whcar_t caching in the Unicode C API".

Note that the wchar_t caching was put in place to allow for external applications and C code to easily and efficiently interface with Python. By removing it you will slow down such code significantly, esp. on Linux and Windows where wchar_t code is fairly common (one of the reasons we added UCS4 in Python was to make the interaction with Linux wchar_t code more efficient).

This should be clearly mentioned as part of the change and the compile time flags.


BTW: You have a few other changes in the PR which don't have anything to do with the intended removal:

-    envsize = PySequence_Fast_GET_SIZE(keys);
-    if (PySequence_Fast_GET_SIZE(values) != envsize) {
+    envsize = PyList_GET_SIZE(keys);
+    if (PyList_GET_SIZE(values) != envsize) {
msg338289 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2019-03-18 21:33
I had also looked through the unrelated changes, and while, yes, they are unrelated, they seemed to be correct and reasonable modernisations of the code base while touching it. They could be moved to a separate PR, but there is a relatively high risk of conflicts, so I'm ok with keeping them in here for now.
msg338290 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2019-03-18 21:53
On 18.03.2019 22:33, Stefan Behnel wrote:
> 
> I had also looked through the unrelated changes, and while, yes, they are unrelated, they seemed to be correct and reasonable modernisations of the code base while touching it. They could be moved to a separate PR, but there is a relatively high risk of conflicts, so I'm ok with keeping them in here for now.

I don't think changing sequence iteration to list iteration only
is something that should be hidden in a wchar_t removal PR.

My guess is that these changes have made it into the PR by mistake.
They deserve a separate PR and discussion.
msg338331 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2019-03-19 08:58
I'm not sure we need two options.
Does USE_UNICODE_WCHAR_CACHE=0 really helps preparing to the removal?
msg338340 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-03-19 10:45
I wrote this PR just to see how much code should be changed after removing the wchar_t cache, and what be performance impact. Get it, experiment with it, run tests and benchmarks. I think we could set USE_UNICODE_WCHAR_CACHE to 0 by default. If this will cause significant troubles, it is easy to set it to 1.

I am going to add configure options for switching these options. On Windows you will still need to edit the config file manually.

> I'm not sure we need two options.
> Does USE_UNICODE_WCHAR_CACHE=0 really helps preparing to the removal?

Currently some of the legacy functions are not decorated with Py_DEPRECATED, because this would cause compiler warnings in the code that uses these functions. If USE_UNICODE_WCHAR_CACHE is 0, these functions will no longer used, so we can add compiler warnings for them.

> I don't think changing sequence iteration to list iteration only
> is something that should be hidden in a wchar_t removal PR.

getenvironment() is the function that has been rewritten to the new API without preserving the old variant. Since the code was rewritten so much, I performed some code clean up. PyMapping_Keys() and PyMapping_Values() always return a list now, so that using the PySequence_Fast API is superfluous. They could return a tuple in the past, but this provoked bugs because the user code used PyList API for it.

I'll open a separate issue for this.

> Since these C macros are public, should they be named PY_* ?

CPython configuration macros (like HAVE_ACOSH or USE_COMPUTED_GOTOS) do not have the PY_ prefix.
msg338343 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2019-03-19 11:21
FYI, I had created PR 12340 which removes use of deprecated API in ctypes.
msg338344 - (view) Author: Ronald Oussoren (ronaldoussoren) * (Python committer) Date: 2019-03-19 11:47
One thing to keep in mind: HAVE_UNICODE_WCHAR_CACHE == 1 and HAVE_UNICODE_WCHAR_CACHE == 0 have a different ABI due to a different struct layout. This should probably affect the ABI tag for extension modules.
msg338565 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2019-03-21 19:39
> The proposed PR adds two compile time options: HAVE_UNICODE_WCHAR_CACHE and USE_UNICODE_WCHAR_CACHE

I don't think this is a good approach.  Most projects and developers don't recompile Python.  It's especially a chore when you have many dependencies with C extensions, because you'll have to recompile them all as well.

I would recommend simply removing that cache.
msg339860 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2019-04-10 12:57
I think these ABI incompatible options are used many people.
But it is helpful to find extensions which using legacy APIs before Python 3.10 is released.

I had found ujson and MarkupSafe used legacy APIs.  I fixed MarkupSafe.
I don't care ujson because it is wrapper of wchar_t based C library
and there are enough json libraries.

I suppose there are some other packages in PyPI, but I'm not sure.
msg355535 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-10-28 11:35
I closed bpo-38604 as a duplicate. Copy of my messages.

 msg355475 - (view) 	Author: STINNER Victor (vstinner) * (Python committer) 	Date: 2019-10-27 16:02

Python 3.3 deprecated the C API functions using Py_UNICODE type. Examples in the doc:

* https://docs.python.org/dev/c-api/unicode.html#c.Py_UNICODE
* https://docs.python.org/dev/c-api/unicode.html#deprecated-py-unicode-apis

Currently, functions removal is scheduled for Python 4.0 but I would prefer that Python 4.0 doesn't have a long list of removed features, but no more than usual. So I'm trying to remove a few functions from Python 3.9, and try to prepare removal for others.

Py_UNICODE C API was mostly kept for backward compatibility with Python 2. Since Python 2 support ends at the end of the year, can we start to organize Py_UNICODE C API removal?

There are multiple questions:

* Should we drop the whole API at once? Or can we/should we start by removing a few functions, and then the others?
* Deprecation warnings are emitted at compilation. But I'm not aware of DeprecationWarning emited at runtime. IMHO we should emit DesprecationWarning at runtime during at least one release, so most developers ignore compilation warnings.

I propose to:

* (Right now) write an exhaustive list of all deprecated APIs: functions, constants, types, etc.
* Modify C code to emit DeprecationWarning at runtime in Python 3.9
* Experiment a modified Python without these APIs and test how many projects are broken by this removal: see PEP 608
* Schedule the actual removal of all these APIS from Python 3.10

Honestly, if the removal is causing too much issues, I'm fine to make slowdown the removal. It's just a matter of clearly communicating our intent.

Maybe we should also announce the scheduled removal in What's in Python 3.9 and in the capi-sig mailing list.

msg355478 - (view) 	Author: STINNER Victor (vstinner) * (Python committer) 	Date: 2019-10-27 16:15

> (Right now) write an exhaustive list of all deprecated APIs: functions, constants, types, etc.

I searched "4.0" in the documentation:

* Py_UNICODE type
* array.array: "u" type
* PyArg_ParseTuple, Py_BuildValue: "u", "u#", "Z", "Z#" formats

* PyUnicode_FromUnicode()
* PyUnicode_GetSize(), PyUnicode_GET_SIZE()
* PyUnicode_AsUnicode(), PyUnicode_AS_UNICODE(), PyUnicode_AS_DATA()
* PyUnicode_AsUnicodeAndSize()
* PyUnicode_AsUnicodeCopy()

* PyUnicode_FromObject()
* PyLong_FromUnicode()
* PyUnicode_TransformDecimalToASCII()

* PyUnicode_Encode()
* PyUnicode_EncodeUTF7()
* PyUnicode_EncodeUTF8()
* PyUnicode_EncodeUTF32()
* PyUnicode_EncodeUTF16()
* PyUnicode_EncodeUnicodeEscape()
* PyUnicode_EncodeRawUnicodeEscape()
* PyUnicode_EncodeLatin1()
* PyUnicode_EncodeASCII()
* PyUnicode_EncodeMBCS()
* PyUnicode_EncodeCharmap()
* PyUnicode_TranslateCharmap()

msg355524 - (view) 	Author: STINNER Victor (vstinner) * (Python committer) 	Date: 2019-10-28 11:06

A preleminary step was to modify PyUnicode_AsWideChar() and PyUnicode_AsWideCharString() to remove the internal caching: it has been done in Python 3.8.0 with bpo-30863.
msg368615 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2020-05-11 06:37
New changeset d5d9a718662e67e2b1ac7874dda9df2d8d71d415 by Inada Naoki in branch 'master':
bpo-36346: array: Don't use deprecated APIs (GH-19653)
https://github.com/python/cpython/commit/d5d9a718662e67e2b1ac7874dda9df2d8d71d415
msg368653 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-05-11 21:18
> bpo-36346: array: Don't use deprecated APIs (GH-19653)

Thanks INADA-san! Another nail into Py_UNICODE coffin!
msg371730 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2020-06-17 11:09
New changeset 2c4928d37edc5e4aeec3c0b79fa3460b1ec9b60d by Inada Naoki in branch 'master':
bpo-36346: Add Py_DEPRECATED to deprecated unicode APIs (GH-20878)
https://github.com/python/cpython/commit/2c4928d37edc5e4aeec3c0b79fa3460b1ec9b60d
msg371731 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-06-17 12:01
> bpo-36346: Add Py_DEPRECATED to deprecated unicode APIs (GH-20878)

This change broke test_distutils on multiple buildbots. Examples:

* https://buildbot.python.org/all/#builders/6/builds/1311
* https://buildbot.python.org/all/#builders/639/builds/729
msg371734 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2020-06-17 12:27
Oh, why I can not use C99?

```
/home/buildbot/buildarea/3.x.cstratak-RHEL7-ppc64le/build/Include/cpython/unicodeobject.h: In function ‘Py_UNICODE_FILL’:
/home/buildbot/buildarea/3.x.cstratak-RHEL7-ppc64le/build/Include/cpython/unicodeobject.h:56:5: error: ‘for’ loop initial declarations are only allowed in C99 mode
     for (Py_ssize_t i = 0; i < length; i++) {
     ^
/home/buildbot/buildarea/3.x.cstratak-RHEL7-ppc64le/build/Include/cpython/unicodeobject.h:56:5: note: use option -std=c99 or -std=gnu99 to compile your code
```
msg371735 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-06-17 12:30
> Oh, why I can not use C99?

PEP 7 requires C99 to build Python, but I think that we can try to keep C89 compatibility for the public header files (Python C API).
msg371745 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2020-06-17 14:43
New changeset 8e34e92caa73259620dd242b92d26edd0949b4ba by Inada Naoki in branch 'master':
bpo-36346: Make unicodeobject.h C89 compatible (GH-20934)
https://github.com/python/cpython/commit/8e34e92caa73259620dd242b92d26edd0949b4ba
msg371795 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2020-06-18 08:31
New changeset 610a60c601fb4380eee30e15be1cd4dcbdaeec4c by Inada Naoki in branch '3.9':
bpo-36346: Add Py_DEPRECATED to deprecated unicode APIs (GH-20878)
https://github.com/python/cpython/commit/610a60c601fb4380eee30e15be1cd4dcbdaeec4c
msg372656 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2020-06-30 06:03
New changeset 349f76c6aace5a4a2b57f6b442a532faf0027d6b by Serhiy Storchaka in branch 'master':
bpo-36346: Prepare for removing the legacy Unicode C API (AC only). (GH-21223)
https://github.com/python/cpython/commit/349f76c6aace5a4a2b57f6b442a532faf0027d6b
msg372658 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2020-06-30 06:27
New changeset 038dd0f79dc89566b01ba66a5a018266b2917a19 by Inada Naoki in branch 'master':
bpo-36346: Raise DeprecationWarning when creating legacy Unicode (GH-20933)
https://github.com/python/cpython/commit/038dd0f79dc89566b01ba66a5a018266b2917a19
msg373032 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2020-07-05 15:13
There is no need to deprecate _PyUnicode_AsUnicode. It is a private function. Undeprecating it will make the code clearer.
msg373035 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2020-07-05 15:53
New changeset b3dd5cd4a36877c473417fd7b3358843dcf8e647 by Serhiy Storchaka in branch 'master':
bpo-36346: Undeprecate private function _PyUnicode_AsUnicode(). (GH-21336)
https://github.com/python/cpython/commit/b3dd5cd4a36877c473417fd7b3358843dcf8e647
msg373450 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2020-07-10 08:17
New changeset d878349bac6c154fbfeffe7d4b38e2ddb833f135 by Serhiy Storchaka in branch 'master':
bpo-36346: Do not use legacy Unicode C API in ctypes. (#21429)
https://github.com/python/cpython/commit/d878349bac6c154fbfeffe7d4b38e2ddb833f135
msg373478 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2020-07-10 20:26
New changeset 4c8f09d7cef8c7aa07d5b5232b5b64f63819a743 by Serhiy Storchaka in branch 'master':
bpo-36346: Make using the legacy Unicode C API optional (GH-21437)
https://github.com/python/cpython/commit/4c8f09d7cef8c7aa07d5b5232b5b64f63819a743
msg374855 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2020-08-05 01:49
New changeset 270b4ad4df795783d417ba15080da8f95e598689 by Inada Naoki in branch 'master':
bpo-36346: Doc: Update removal schedule of legacy Unicode (GH-21479)
https://github.com/python/cpython/commit/270b4ad4df795783d417ba15080da8f95e598689
msg374856 - (view) Author: miss-islington (miss-islington) Date: 2020-08-05 01:56
New changeset ea680631b478f091a171dc802d861f5014f58c8f by Miss Islington (bot) in branch '3.9':
bpo-36346: Doc: Update removal schedule of legacy Unicode (GH-21479)
https://github.com/python/cpython/commit/ea680631b478f091a171dc802d861f5014f58c8f
msg374857 - (view) Author: miss-islington (miss-islington) Date: 2020-08-05 01:57
New changeset f0e030cacb940f061e0b09efbffc2fd984b95259 by Miss Islington (bot) in branch '3.8':
bpo-36346: Doc: Update removal schedule of legacy Unicode (GH-21479)
https://github.com/python/cpython/commit/f0e030cacb940f061e0b09efbffc2fd984b95259
History
Date User Action Args
2020-08-05 01:57:13miss-islingtonsetmessages: + msg374857
2020-08-05 01:56:14miss-islingtonsetmessages: + msg374856
2020-08-05 01:49:27miss-islingtonsetpull_requests: + pull_request20885
2020-08-05 01:49:18methanesetmessages: + msg374855
2020-08-05 01:49:17miss-islingtonsetpull_requests: + pull_request20884
2020-07-15 05:11:20methanesetpull_requests: + pull_request20622
2020-07-10 20:26:13serhiy.storchakasetmessages: + msg373478
2020-07-10 15:47:51serhiy.storchakasetpull_requests: + pull_request20584
2020-07-10 08:17:25serhiy.storchakasetmessages: + msg373450
2020-07-10 07:36:25serhiy.storchakasetpull_requests: + pull_request20576
2020-07-05 15:53:55serhiy.storchakasetmessages: + msg373035
2020-07-05 15:18:07serhiy.storchakasetpull_requests: + pull_request20484
2020-07-05 15:13:29serhiy.storchakasetmessages: + msg373032
2020-06-30 06:27:03methanesetmessages: + msg372658
2020-06-30 06:03:22serhiy.storchakasetmessages: + msg372656
2020-06-29 20:27:28serhiy.storchakasetpull_requests: + pull_request20375
2020-06-18 08:31:23methanesetmessages: + msg371795
2020-06-17 15:06:04methanesetpull_requests: + pull_request20122
2020-06-17 14:48:26shihai1991setnosy: + shihai1991
2020-06-17 14:43:09methanesetmessages: + msg371745
2020-06-17 12:31:17methanesetpull_requests: + pull_request20113
2020-06-17 12:30:50vstinnersetmessages: + msg371735
2020-06-17 12:27:49methanesetmessages: + msg371734
2020-06-17 12:01:03vstinnersetmessages: + msg371731
2020-06-17 11:33:56methanesetpull_requests: + pull_request20112
2020-06-17 11:10:11miss-islingtonsetnosy: + miss-islington
pull_requests: + pull_request20111
2020-06-17 11:09:48methanesetmessages: + msg371730
2020-06-17 05:12:43methanesetpull_requests: + pull_request20106
2020-06-15 01:46:48methanesetpull_requests: + pull_request20066
2020-06-15 01:15:16methanesetpull_requests: + pull_request20065
2020-05-11 21:18:31vstinnersetmessages: + msg368653
2020-05-11 06:37:32methanesetmessages: + msg368615
2020-04-22 16:08:32corona10setnosy: + corona10
2020-04-22 13:14:06methanesetpull_requests: + pull_request18979
2019-12-09 16:12:46vstinnersetcomponents: + C API
2019-10-28 11:35:14vstinnersetnosy: + vstinner
messages: + msg355535
2019-10-28 11:34:46vstinnerlinkissue38604 superseder
2019-04-10 16:49:37vstinnersetnosy: - vstinner
2019-04-10 12:57:16methanesetmessages: + msg339860
2019-03-21 19:39:00pitrousetnosy: + pitrou
messages: + msg338565
2019-03-21 06:40:17serhiy.storchakasetdependencies: + Refactor getenvironment() in _winapi.c
2019-03-19 11:47:07ronaldoussorensetnosy: + ronaldoussoren
messages: + msg338344
2019-03-19 11:21:51methanesetmessages: + msg338343
2019-03-19 10:45:21serhiy.storchakasetmessages: + msg338340
2019-03-19 08:58:10methanesetmessages: + msg338331
2019-03-18 21:53:01lemburgsetmessages: + msg338290
2019-03-18 21:33:35scodersetmessages: + msg338289
2019-03-18 21:06:40lemburgsetnosy: + lemburg
messages: + msg338286
2019-03-18 20:46:26scodersetmessages: + msg338285
2019-03-18 20:05:26scodersetmessages: + msg338284
2019-03-18 20:01:02scodersetnosy: + scoder
2019-03-18 14:27:35serhiy.storchakasetkeywords: + patch
stage: patch review
pull_requests: + pull_request12363
2019-03-18 14:23:01serhiy.storchakacreate