Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prepare for removing the legacy Unicode C API #80527

Closed
serhiy-storchaka opened this issue Mar 18, 2019 · 36 comments
Closed

Prepare for removing the legacy Unicode C API #80527

serhiy-storchaka opened this issue Mar 18, 2019 · 36 comments
Labels
3.8 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-C-API topic-unicode

Comments

@serhiy-storchaka
Copy link
Member

serhiy-storchaka commented Mar 18, 2019

BPO 36346
Nosy @malemburg, @ronaldoussoren, @pitrou, @scoder, @vstinner, @ezio-melotti, @methane, @serhiy-storchaka, @willingc, @corona10, @miss-islington, @shihai1991, @iritkatriel
PRs
  • bpo-36346: Prepare for removing the legacy Unicode C API. #12409
  • bpo-36346: array: Don't use deprecated APIs #19653
  • bpo-36346: Add Py_DEPRECATED to deprecated unicode APIs #20878
  • bpo-36346: Document removal schedule of deprecate APIs #20879
  • bpo-36346: Emit DeprecationWarning for PyArg_Parse() with 'u' or 'Z'. #20927
  • [3.9] bpo-36346: Add Py_DEPRECATED to deprecated unicode APIs (GH-20878) #20932
  • bpo-36346: Raise DeprecationWarning when creating legacy Unicode #20933
  • bpo-36346: Make unicodeobject.h C89 compatible #20934
  • [3.9] bpo-36346: Add Py_DEPRECATED to deprecated unicode APIs (GH-20878) #20941
  • bpo-36346: Prepare for removing the legacy Unicode C API (AC only). #21223
  • bpo-36346: Undeprecate private function _PyUnicode_AsUnicode(). #21336
  • bpo-36346: Do not use legacy Unicode C API in ctypes. #21429
  • bpo-36346: Make using the legacy Unicode C API optional #21437
  • bpo-36346: Doc: Update removal schedule of legacy Unicode #21479
  • [3.9] bpo-36346: Doc: Update removal schedule of legacy Unicode (GH-21479) #21738
  • [3.8] bpo-36346: Doc: Update removal schedule of legacy Unicode (GH-21479) #21739
  • [3.9] bpo-36346: Document removal schedule of deprecate APIs (GH-20879) #24625
  • [3.8] bpo-36346: Document removal schedule of deprecate APIs (GH-20879) #24626
  • Dependencies
  • bpo-36387: Refactor getenvironment() in _winapi.c
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2022-01-29.03:10:20.373>
    created_at = <Date 2019-03-18.14:23:01.818>
    labels = ['interpreter-core', 'expert-C-API', '3.8', 'expert-unicode']
    title = 'Prepare for removing the legacy Unicode C API'
    updated_at = <Date 2022-01-29.03:10:20.372>
    user = 'https://github.com/serhiy-storchaka'

    bugs.python.org fields:

    activity = <Date 2022-01-29.03:10:20.372>
    actor = 'methane'
    assignee = 'none'
    closed = True
    closed_date = <Date 2022-01-29.03:10:20.373>
    closer = 'methane'
    components = ['Interpreter Core', 'Unicode', 'C API']
    creation = <Date 2019-03-18.14:23:01.818>
    creator = 'serhiy.storchaka'
    dependencies = ['36387']
    files = []
    hgrepos = []
    issue_num = 36346
    keywords = ['patch']
    message_count = 36.0
    messages = ['338228', '338284', '338285', '338286', '338289', '338290', '338331', '338340', '338343', '338344', '338565', '339860', '355535', '368615', '368653', '371730', '371731', '371734', '371735', '371745', '371795', '372656', '372658', '373032', '373035', '373450', '373478', '374855', '374856', '374857', '387513', '387545', '387546', '387828', '412025', '412047']
    nosy_count = 13.0
    nosy_names = ['lemburg', 'ronaldoussoren', 'pitrou', 'scoder', 'vstinner', 'ezio.melotti', 'methane', 'serhiy.storchaka', 'willingc', 'corona10', 'miss-islington', 'shihai1991', 'iritkatriel']
    pr_nums = ['12409', '19653', '20878', '20879', '20927', '20932', '20933', '20934', '20941', '21223', '21336', '21429', '21437', '21479', '21738', '21739', '24625', '24626']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue36346'
    versions = ['Python 3.8']

    Linked PRs

    @serhiy-storchaka
    Copy link
    Member Author

    The legacy Unicode C API was deprecated in 3.3. Its support consumes resources: more memory usage by Unicode objects, additional code for handling Unicode objects created with the legacy C API. Currently every Unicode object has a cache for the wchar_t representation.

    The proposed PR adds two compile time options: HAVE_UNICODE_WCHAR_CACHE and USE_UNICODE_WCHAR_CACHE. Both are set to 1 by default.

    If USE_UNICODE_WCHAR_CACHE is set to 0, CPython will not use the wchar_t cache internally. The new wchar_t based C API will be used instead of the Py_UNICODE based C API. This can add small performance penalty for creating a temporary buffer for the wchar_t representation. On other hand, this will decrease the long-term memory usage. This build is binary compatible with the standard build and third-party extensions can use the legacy Unicode C API.

    If HAVE_UNICODE_WCHAR_CACHE is set to 0, the wchar_t cache will be completely removed. The legacy Unicode C API will be not available, and functions that need it (e.g. PyArg_ParseTuple() with the "u" format unit) will always fail. This build is binary incompatible with the standard build if you use the legacy or non-stable Unicode C API.

    I hope that these options will help third-party projects to prepare for removing the legacy Unicode C API in future.

    @serhiy-storchaka serhiy-storchaka added 3.8 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-unicode labels Mar 18, 2019
    @scoder
    Copy link
    Contributor

    scoder commented Mar 18, 2019

    Thanks for implementing this, Serhiy.
    Since these C macros are public, should they be named PY_* ?

    @scoder
    Copy link
    Contributor

    scoder commented Mar 18, 2019

    I think this is a good preparation that makes it clear what code will eventually be removed, and allows testing without it.

    No idea how happy Windows users will be about all of this, but I consider it quite an overall improvement for the Unicode implementation. Once this gets removed, that is.

    Removing the "unicode_internal" codec entirely (which is changed by this PR) is discussed in bpo-36297.

    @malemburg
    Copy link
    Member

    I'd change the title of this bpo item to "Prepare for removing the whcar_t caching in the Unicode C API".

    Note that the wchar_t caching was put in place to allow for external applications and C code to easily and efficiently interface with Python. By removing it you will slow down such code significantly, esp. on Linux and Windows where wchar_t code is fairly common (one of the reasons we added UCS4 in Python was to make the interaction with Linux wchar_t code more efficient).

    This should be clearly mentioned as part of the change and the compile time flags.

    BTW: You have a few other changes in the PR which don't have anything to do with the intended removal:

    -    envsize = PySequence_Fast_GET_SIZE(keys);
    -    if (PySequence_Fast_GET_SIZE(values) != envsize) {
    +    envsize = PyList_GET_SIZE(keys);
    +    if (PyList_GET_SIZE(values) != envsize) {

    @scoder
    Copy link
    Contributor

    scoder commented Mar 18, 2019

    I had also looked through the unrelated changes, and while, yes, they are unrelated, they seemed to be correct and reasonable modernisations of the code base while touching it. They could be moved to a separate PR, but there is a relatively high risk of conflicts, so I'm ok with keeping them in here for now.

    @malemburg
    Copy link
    Member

    On 18.03.2019 22:33, Stefan Behnel wrote:

    I had also looked through the unrelated changes, and while, yes, they are unrelated, they seemed to be correct and reasonable modernisations of the code base while touching it. They could be moved to a separate PR, but there is a relatively high risk of conflicts, so I'm ok with keeping them in here for now.

    I don't think changing sequence iteration to list iteration only
    is something that should be hidden in a wchar_t removal PR.

    My guess is that these changes have made it into the PR by mistake.
    They deserve a separate PR and discussion.

    @methane
    Copy link
    Member

    methane commented Mar 19, 2019

    I'm not sure we need two options.
    Does USE_UNICODE_WCHAR_CACHE=0 really helps preparing to the removal?

    @serhiy-storchaka
    Copy link
    Member Author

    I wrote this PR just to see how much code should be changed after removing the wchar_t cache, and what be performance impact. Get it, experiment with it, run tests and benchmarks. I think we could set USE_UNICODE_WCHAR_CACHE to 0 by default. If this will cause significant troubles, it is easy to set it to 1.

    I am going to add configure options for switching these options. On Windows you will still need to edit the config file manually.

    I'm not sure we need two options.
    Does USE_UNICODE_WCHAR_CACHE=0 really helps preparing to the removal?

    Currently some of the legacy functions are not decorated with Py_DEPRECATED, because this would cause compiler warnings in the code that uses these functions. If USE_UNICODE_WCHAR_CACHE is 0, these functions will no longer used, so we can add compiler warnings for them.

    I don't think changing sequence iteration to list iteration only
    is something that should be hidden in a wchar_t removal PR.

    getenvironment() is the function that has been rewritten to the new API without preserving the old variant. Since the code was rewritten so much, I performed some code clean up. PyMapping_Keys() and PyMapping_Values() always return a list now, so that using the PySequence_Fast API is superfluous. They could return a tuple in the past, but this provoked bugs because the user code used PyList API for it.

    I'll open a separate issue for this.

    Since these C macros are public, should they be named PY_* ?

    CPython configuration macros (like HAVE_ACOSH or USE_COMPUTED_GOTOS) do not have the PY_ prefix.

    @methane
    Copy link
    Member

    methane commented Mar 19, 2019

    FYI, I had created PR 12340 which removes use of deprecated API in ctypes.

    @ronaldoussoren
    Copy link
    Contributor

    One thing to keep in mind: HAVE_UNICODE_WCHAR_CACHE == 1 and HAVE_UNICODE_WCHAR_CACHE == 0 have a different ABI due to a different struct layout. This should probably affect the ABI tag for extension modules.

    @pitrou
    Copy link
    Member

    pitrou commented Mar 21, 2019

    The proposed PR adds two compile time options: HAVE_UNICODE_WCHAR_CACHE and USE_UNICODE_WCHAR_CACHE

    I don't think this is a good approach. Most projects and developers don't recompile Python. It's especially a chore when you have many dependencies with C extensions, because you'll have to recompile them all as well.

    I would recommend simply removing that cache.

    @methane
    Copy link
    Member

    methane commented Apr 10, 2019

    I think these ABI incompatible options are used many people.
    But it is helpful to find extensions which using legacy APIs before Python 3.10 is released.

    I had found ujson and MarkupSafe used legacy APIs. I fixed MarkupSafe.
    I don't care ujson because it is wrapper of wchar_t based C library
    and there are enough json libraries.

    I suppose there are some other packages in PyPI, but I'm not sure.

    @vstinner
    Copy link
    Member

    I closed bpo-38604 as a duplicate. Copy of my messages.

    msg355475 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-10-27 16:02

    Python 3.3 deprecated the C API functions using Py_UNICODE type. Examples in the doc:

    Currently, functions removal is scheduled for Python 4.0 but I would prefer that Python 4.0 doesn't have a long list of removed features, but no more than usual. So I'm trying to remove a few functions from Python 3.9, and try to prepare removal for others.

    Py_UNICODE C API was mostly kept for backward compatibility with Python 2. Since Python 2 support ends at the end of the year, can we start to organize Py_UNICODE C API removal?

    There are multiple questions:

    • Should we drop the whole API at once? Or can we/should we start by removing a few functions, and then the others?
    • Deprecation warnings are emitted at compilation. But I'm not aware of DeprecationWarning emited at runtime. IMHO we should emit DesprecationWarning at runtime during at least one release, so most developers ignore compilation warnings.

    I propose to:

    • (Right now) write an exhaustive list of all deprecated APIs: functions, constants, types, etc.
    • Modify C code to emit DeprecationWarning at runtime in Python 3.9
    • Experiment a modified Python without these APIs and test how many projects are broken by this removal: see PEP-608
    • Schedule the actual removal of all these APIS from Python 3.10

    Honestly, if the removal is causing too much issues, I'm fine to make slowdown the removal. It's just a matter of clearly communicating our intent.

    Maybe we should also announce the scheduled removal in What's in Python 3.9 and in the capi-sig mailing list.

    msg355478 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-10-27 16:15

    (Right now) write an exhaustive list of all deprecated APIs: functions, constants, types, etc.

    I searched "4.0" in the documentation:

    • Py_UNICODE type

    • array.array: "u" type

    • PyArg_ParseTuple, Py_BuildValue: "u", "u#", "Z", "Z#" formats

    • PyUnicode_FromUnicode()

    • PyUnicode_GetSize(), PyUnicode_GET_SIZE()

    • PyUnicode_AsUnicode(), PyUnicode_AS_UNICODE(), PyUnicode_AS_DATA()

    • PyUnicode_AsUnicodeAndSize()

    • PyUnicode_AsUnicodeCopy()

    • PyUnicode_FromObject()

    • PyLong_FromUnicode()

    • PyUnicode_TransformDecimalToASCII()

    • PyUnicode_Encode()

    • PyUnicode_EncodeUTF7()

    • PyUnicode_EncodeUTF8()

    • PyUnicode_EncodeUTF32()

    • PyUnicode_EncodeUTF16()

    • PyUnicode_EncodeUnicodeEscape()

    • PyUnicode_EncodeRawUnicodeEscape()

    • PyUnicode_EncodeLatin1()

    • PyUnicode_EncodeASCII()

    • PyUnicode_EncodeMBCS()

    • PyUnicode_EncodeCharmap()

    • PyUnicode_TranslateCharmap()

    msg355524 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-10-28 11:06

    A preleminary step was to modify PyUnicode_AsWideChar() and PyUnicode_AsWideCharString() to remove the internal caching: it has been done in Python 3.8.0 with bpo-30863.

    @methane
    Copy link
    Member

    methane commented May 11, 2020

    New changeset d5d9a71 by Inada Naoki in branch 'master':
    bpo-36346: array: Don't use deprecated APIs (GH-19653)
    d5d9a71

    @vstinner
    Copy link
    Member

    bpo-36346: array: Don't use deprecated APIs (GH-19653)

    Thanks INADA-san! Another nail into Py_UNICODE coffin!

    @methane
    Copy link
    Member

    methane commented Jun 17, 2020

    New changeset 2c4928d by Inada Naoki in branch 'master':
    bpo-36346: Add Py_DEPRECATED to deprecated unicode APIs (GH-20878)
    2c4928d

    @vstinner
    Copy link
    Member

    bpo-36346: Add Py_DEPRECATED to deprecated unicode APIs (GH-20878)

    This change broke test_distutils on multiple buildbots. Examples:

    @methane
    Copy link
    Member

    methane commented Jun 17, 2020

    Oh, why I can not use C99?

    /home/buildbot/buildarea/3.x.cstratak-RHEL7-ppc64le/build/Include/cpython/unicodeobject.h: In function ‘Py_UNICODE_FILL’:
    /home/buildbot/buildarea/3.x.cstratak-RHEL7-ppc64le/build/Include/cpython/unicodeobject.h:56:5: error: ‘for’ loop initial declarations are only allowed in C99 mode
         for (Py_ssize_t i = 0; i < length; i++) {
         ^
    /home/buildbot/buildarea/3.x.cstratak-RHEL7-ppc64le/build/Include/cpython/unicodeobject.h:56:5: note: use option -std=c99 or -std=gnu99 to compile your code
    

    @vstinner
    Copy link
    Member

    Oh, why I can not use C99?

    PEP-7 requires C99 to build Python, but I think that we can try to keep C89 compatibility for the public header files (Python C API).

    @methane
    Copy link
    Member

    methane commented Jun 17, 2020

    New changeset 8e34e92 by Inada Naoki in branch 'master':
    bpo-36346: Make unicodeobject.h C89 compatible (GH-20934)
    8e34e92

    @methane
    Copy link
    Member

    methane commented Jun 18, 2020

    New changeset 610a60c by Inada Naoki in branch '3.9':
    bpo-36346: Add Py_DEPRECATED to deprecated unicode APIs (GH-20878)
    610a60c

    @serhiy-storchaka
    Copy link
    Member Author

    New changeset 349f76c by Serhiy Storchaka in branch 'master':
    bpo-36346: Prepare for removing the legacy Unicode C API (AC only). (GH-21223)
    349f76c

    @methane
    Copy link
    Member

    methane commented Jun 30, 2020

    New changeset 038dd0f by Inada Naoki in branch 'master':
    bpo-36346: Raise DeprecationWarning when creating legacy Unicode (GH-20933)
    038dd0f

    @serhiy-storchaka
    Copy link
    Member Author

    There is no need to deprecate _PyUnicode_AsUnicode. It is a private function. Undeprecating it will make the code clearer.

    @serhiy-storchaka
    Copy link
    Member Author

    New changeset b3dd5cd by Serhiy Storchaka in branch 'master':
    bpo-36346: Undeprecate private function _PyUnicode_AsUnicode(). (GH-21336)
    b3dd5cd

    @serhiy-storchaka
    Copy link
    Member Author

    New changeset d878349 by Serhiy Storchaka in branch 'master':
    bpo-36346: Do not use legacy Unicode C API in ctypes. (bpo-21429)
    d878349

    @serhiy-storchaka
    Copy link
    Member Author

    New changeset 4c8f09d by Serhiy Storchaka in branch 'master':
    bpo-36346: Make using the legacy Unicode C API optional (GH-21437)
    4c8f09d

    @methane
    Copy link
    Member

    methane commented Aug 5, 2020

    New changeset 270b4ad by Inada Naoki in branch 'master':
    bpo-36346: Doc: Update removal schedule of legacy Unicode (GH-21479)
    270b4ad

    @miss-islington
    Copy link
    Contributor

    New changeset ea68063 by Miss Islington (bot) in branch '3.9':
    bpo-36346: Doc: Update removal schedule of legacy Unicode (GH-21479)
    ea68063

    @miss-islington
    Copy link
    Contributor

    New changeset f0e030c by Miss Islington (bot) in branch '3.8':
    bpo-36346: Doc: Update removal schedule of legacy Unicode (GH-21479)
    f0e030c

    @methane
    Copy link
    Member

    methane commented Feb 22, 2021

    New changeset 91a639a by Inada Naoki in branch 'master':
    bpo-36346: Emit DeprecationWarning for PyArg_Parse() with 'u' or 'Z'. (GH-20927)
    91a639a

    @methane
    Copy link
    Member

    methane commented Feb 22, 2021

    New changeset 2d6f2ee by Inada Naoki in branch 'master':
    bpo-36346: Document removal schedule of deprecate APIs (GH-20879)
    2d6f2ee

    @miss-islington
    Copy link
    Contributor

    New changeset 93853b7 by Miss Islington (bot) in branch '3.9':
    bpo-36346: Document removal schedule of deprecate APIs (GH-20879)
    93853b7

    @willingc
    Copy link
    Contributor

    willingc commented Mar 1, 2021

    New changeset 346afeb by Miss Islington (bot) in branch '3.8':
    bpo-36346: Document removal schedule of deprecate APIs (GH-20879) (bpo-24626)
    346afeb

    @iritkatriel
    Copy link
    Member

    Is there anything left to do here?

    @methane
    Copy link
    Member

    methane commented Jan 29, 2022

    No. I just waiting Python 3.11 become Bata.

    @methane methane closed this as completed Jan 29, 2022
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    vstinner added a commit that referenced this issue Apr 22, 2022
    Deprecate functions:
    
    * PyUnicode_AS_DATA()
    * PyUnicode_AS_UNICODE()
    * PyUnicode_GET_DATA_SIZE()
    * PyUnicode_GET_SIZE()
    
    Previously, these functions were macros and so it wasn't possible to
    decorate them with Py_DEPRECATED().
    vstinner added a commit to vstinner/cpython that referenced this issue Aug 24, 2023
    The decorator now requires to be called:
    
        @support.requires_legacy_unicode_capi()
    
    instead of:
    
        @support.requires_legacy_unicode_capi
    
    The implementation now only imports _testcapi when the decorator is
    called, so "import test.support" no longer imports the _testcapi
    extension.
    vstinner added a commit to vstinner/cpython that referenced this issue Aug 24, 2023
    The decorator now requires to be called with parenthesis:
    
        @support.requires_legacy_unicode_capi()
    
    instead of:
    
        @support.requires_legacy_unicode_capi
    
    The implementation now only imports _testcapi when the decorator is
    called, so "import test.support" no longer imports the _testcapi
    extension.
    vstinner added a commit that referenced this issue Aug 24, 2023
    The decorator now requires to be called with parenthesis:
    
        @support.requires_legacy_unicode_capi()
    
    instead of:
    
        @support.requires_legacy_unicode_capi
    
    The implementation now only imports _testcapi when the decorator is
    called, so "import test.support" no longer imports the _testcapi
    extension.
    miss-islington pushed a commit to miss-islington/cpython that referenced this issue Aug 24, 2023
    …GH-108438)
    
    The decorator now requires to be called with parenthesis:
    
        @support.requires_legacy_unicode_capi()
    
    instead of:
    
        @support.requires_legacy_unicode_capi
    
    The implementation now only imports _testcapi when the decorator is
    called, so "import test.support" no longer imports the _testcapi
    extension.
    (cherry picked from commit 995f4c4)
    
    Co-authored-by: Victor Stinner <vstinner@python.org>
    Yhg1s pushed a commit that referenced this issue Aug 25, 2023
    …8438) (#108446)
    
    gh-80527: Change support.requires_legacy_unicode_capi() (GH-108438)
    
    The decorator now requires to be called with parenthesis:
    
        @support.requires_legacy_unicode_capi()
    
    instead of:
    
        @support.requires_legacy_unicode_capi
    
    The implementation now only imports _testcapi when the decorator is
    called, so "import test.support" no longer imports the _testcapi
    extension.
    (cherry picked from commit 995f4c4)
    
    Co-authored-by: Victor Stinner <vstinner@python.org>
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.8 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-C-API topic-unicode
    Projects
    None yet
    Development

    No branches or pull requests

    10 participants