Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanup the unicodedata module #86323

Closed
vstinner opened this issue Oct 26, 2020 · 8 comments
Closed

Cleanup the unicodedata module #86323

vstinner opened this issue Oct 26, 2020 · 8 comments
Labels
3.10 only security fixes stdlib Python modules in the Lib dir topic-unicode

Comments

@vstinner
Copy link
Member

BPO 42157
Nosy @malemburg, @vstinner, @ezio-melotti
PRs
  • bpo-42157: unicodedata avoids references to UCD_Type #22990
  • bpo-42157: Convert unicodedata.UCD to heap type #22991
  • bpo-42157: Rename unicodedata.ucnhash_CAPI #22994
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2020-10-27.03:40:56.728>
    created_at = <Date 2020-10-26.17:05:20.781>
    labels = ['library', '3.10', 'expert-unicode']
    title = 'Cleanup the unicodedata module'
    updated_at = <Date 2020-10-27.03:40:56.728>
    user = 'https://github.com/vstinner'

    bugs.python.org fields:

    activity = <Date 2020-10-27.03:40:56.728>
    actor = 'vstinner'
    assignee = 'none'
    closed = True
    closed_date = <Date 2020-10-27.03:40:56.728>
    closer = 'vstinner'
    components = ['Library (Lib)', 'Unicode']
    creation = <Date 2020-10-26.17:05:20.781>
    creator = 'vstinner'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 42157
    keywords = ['patch']
    message_count = 8.0
    messages = ['379673', '379674', '379678', '379693', '379699', '379719', '379727', '379728']
    nosy_count = 3.0
    nosy_names = ['lemburg', 'vstinner', 'ezio.melotti']
    pr_nums = ['22990', '22991', '22994']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue42157'
    versions = ['Python 3.10']

    @vstinner
    Copy link
    Member Author

    Mohamed Koubaa and me are trying to convert the unicodedata module to the multi-phase initialization API (PEP-489) and to convert the UCD static type to a heap type in bpo-1635741.

    The unicodedata extension module has some special cases:

    • It has a C API exposes in Python as the "unicodedata.ucnhash_CAPI" PyCapsule object.
    • In C, the unicodedata_functions array is used to define module functions *AND* unicodedata.UCD methods. It is unused to do that and makes the conversion more tricky.
    • Most C functions have a "self" parameter which is used to choose between the current version of the Unicode database and the version 3.2.0 ("unicodedata.ucd_3_2_0").

    There is also a unicodedata.UCD type which cannot be instanciated in Python. It is only used to create the unicodedata.ucd_3_2_0 instance.

    In the commit 47e1afd, I moved the private _PyUnicode_Name_CAPI structure to internal C API.

    In the commit ddc0dd0, Mohammed added a ucd_type parameter to the UCD_Check() macro. I asked him to do that.

    In the commit e6b8c52, I added a "global module state" and a "state" parameter to most functions. This change prepares the code base to pass a UCD type instance to functions, to be able to have more than once UCD type when it will be converted to a heap type, one type per module instance.

    The technical problem is that unicodedata_functions is used for module functions and UCD methods. Duplicating unicodedata_functions requires to duplicate a lot of code and comments.

    Sadly, it does not seem easily possible to retrieve the "module state" ("state" variable) in functions since unicodedata_functions is reused for module functioins and UCD methods. Using "defining_class" in Argument Clinic would require to duplicate all unicodedata_functions functions, one flavor for module functions, one flavor for UCD type. It would also require to duplicate all docstrings, which means to increase the maintenance burden and introduce a risk of having inconsistencies.

    Maybe we could introduce a new UCD instance which would be mapped to the current Unicode Character Database version, and module functions which be bounded methods of this type. But it sounds overkill to me.

    By the way, Unicode 3.2 was released in 2002: 18 years ago. I don't think that it's still relevant in 2020 to keep backward compatibility with Unicode 3.2. I propose to deprecate unicodedata.ucd_3_2_0 and deprecate the unicodedate.UCD type. In Python 3.12, we will be able to remove a lot of code, and simplify the code a lot.

    For now, we can convert unicodedata to the multi-phase initilization API (PEP-489) and convert UCD static type to a heap type by avoiding references to the UCD type. Rather than checking if self is an instance of UCD_Type, we can check if it is not a module (PyModule_Check). This is exactly what Mohammed proposed in the first place, but I misunderstood the whole issue and gave him bad advices.

    @vstinner vstinner added 3.10 only security fixes stdlib Python modules in the Lib dir labels Oct 26, 2020
    @malemburg
    Copy link
    Member

    On 26.10.2020 18:05, STINNER Victor wrote:

    By the way, Unicode 3.2 was released in 2002: 18 years ago. I don't think that it's still relevant in 2020 to keep backward compatibility with Unicode 3.2. I propose to deprecate unicodedata.ucd_3_2_0 and deprecate the unicodedate.UCD type. In Python 3.12, we will be able to remove a lot of code, and simplify the code a lot.

    The version 3.2.0 is needed for IDNA compatibility:

    IDNA 2003: https://tools.ietf.org/html/rfc3490
    requires Unicode 3.2 mappings

    IDNA 2008: https://tools.ietf.org/html/rfc5890 et al.
    requires Unicode 5.2+ mappings

    Python only supports IDNA 2003 AFAIK and the ucs_3_2_0 tag was added
    by Martin von Löwis to support it even after moving forward to more
    recent Unicode versions.

    IDNA 2008 seems to have mechanisms to also work for Unicode versions
    later than 5.2, but I don't know the details. See this TR for details
    on how IDNA compatibility is handled:

    http://www.unicode.org/reports/tr46/

    All that said, it may actually be better to deprecate IDNA 2003 support
    first and direct people to:

    https://pypi.org/project/idna/

    or incorporate this into the stdlib instead of IDNA 2003. The special
    tag can then be dropped.

    --
    Marc-Andre Lemburg
    eGenix.com

    Professional Python Services directly from the Experts (#1, Oct 26 2020)
    >>> Python Projects, Coaching and Support ...    https://www.egenix.com/
    >>> Python Product Development ...        https://consulting.egenix.com/
    ________________________________________________________________________

    ::: We implement business ideas - efficiently in both time and costs :::

    eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
    Registered at Amtsgericht Duesseldorf: HRB 46611
    https://www.egenix.com/company/contact/
    https://www.malemburg.com/

    @vstinner
    Copy link
    Member Author

    New changeset 920cb64 by Victor Stinner in branch 'master':
    bpo-42157: unicodedata avoids references to UCD_Type (GH-22990)
    920cb64

    @vstinner
    Copy link
    Member Author

    New changeset c8c4200 by Victor Stinner in branch 'master':
    bpo-42157: Convert unicodedata.UCD to heap type (GH-22991)
    c8c4200

    @vstinner
    Copy link
    Member Author

    By the way, Unicode 3.2 was released in 2002: 18 years ago. I don't think that it's still relevant in 2020 to keep backward compatibility with Unicode 3.2. I propose to deprecate unicodedata.ucd_3_2_0 and deprecate the unicodedate.UCD type. In Python 3.12, we will be able to remove a lot of code, and simplify the code a lot.

    Oh, it is used by the IDNA encoding (encodings.idna module) and the stringprep module (which is used by the encodings.idna module.

    @vstinner
    Copy link
    Member Author

    The version 3.2.0 is needed for IDNA compatibility (...)

    Oh, I missed your comment. I also discovered it by trying to remove it :-)

    So I think that the last thing to do for this issue is to remove unicodedata.ucnhash_CAPI: PR 22994.

    @vstinner
    Copy link
    Member Author

    New changeset 84f7382 by Victor Stinner in branch 'master':
    bpo-42157: Rename unicodedata.ucnhash_CAPI (GH-22994)
    84f7382

    @vstinner
    Copy link
    Member Author

    I kept unicodedata.ucd_3_2_0 and added a comment to explain why it's still relevant in 2020.

    I'm done with tasks listed in this issue, so I close it.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.10 only security fixes stdlib Python modules in the Lib dir topic-unicode
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants