Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding and alias issues #79732

Closed
BLKSerene mannequin opened this issue Dec 21, 2018 · 8 comments
Closed

Encoding and alias issues #79732

BLKSerene mannequin opened this issue Dec 21, 2018 · 8 comments
Assignees
Labels
3.8 only security fixes topic-unicode type-feature A feature request or enhancement

Comments

@BLKSerene
Copy link
Mannequin

BLKSerene mannequin commented Dec 21, 2018

BPO 35551
Nosy @malemburg, @vstinner, @ezio-melotti, @methane, @csabella, @BLKSerene, @epicfaace
PRs
  • bpo-35551: encodings update #11446
  • bpo-35551: encodings update #11446
  • bpo-35551: encodings update #11446
  • bpo-35551: encodings update #11446
  • bpo-35551: remove mac_centeuro encoding #13856
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/malemburg'
    closed_at = <Date 2019-06-05.22:20:09.360>
    created_at = <Date 2018-12-21.10:08:12.249>
    labels = ['type-feature', '3.8', 'expert-unicode']
    title = 'Encoding and alias issues'
    updated_at = <Date 2019-06-06.05:39:04.327>
    user = 'https://github.com/BLKSerene'

    bugs.python.org fields:

    activity = <Date 2019-06-06.05:39:04.327>
    actor = 'methane'
    assignee = 'lemburg'
    closed = True
    closed_date = <Date 2019-06-05.22:20:09.360>
    closer = 'cheryl.sabella'
    components = ['Unicode']
    creation = <Date 2018-12-21.10:08:12.249>
    creator = 'blkserene'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 35551
    keywords = ['patch', 'patch', 'patch', 'patch']
    message_count = 8.0
    messages = ['332285', '333115', '336493', '336496', '336497', '344749', '344773', '344788']
    nosy_count = 7.0
    nosy_names = ['lemburg', 'vstinner', 'ezio.melotti', 'methane', 'cheryl.sabella', 'blkserene', 'epicfaace']
    pr_nums = ['11446', '11446', '11446', '11446', '13856']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue35551'
    versions = ['Python 3.8']

    @BLKSerene
    Copy link
    Mannequin Author

    BLKSerene mannequin commented Dec 21, 2018

    There're some minor issues about encodings supported by Python.

    1. "tis260" is the alias for "tactis", where "tis260" might be a typo, which should be tis620. And "tactis" is not a supported encoding by Python (and I can't find any information about this encoding on Google).
    2. "mac_latin2" and "mac_centeuro" refer to the same encoding (the decoding tables are identical), but they are provided as two encodings in different names ("maccentraleurope" is an alias for "mac_latin2", but "mac_centeuro" isn't).
    3. The same problem for "latin_1" and "iso8859_1" ("iso_8859_1" is an alias for "latin_1", but "iso8859_1" isn't).

    @BLKSerene BLKSerene mannequin added 3.7 (EOL) end of life topic-unicode type-feature A feature request or enhancement labels Dec 21, 2018
    @epicfaace
    Copy link
    Mannequin

    epicfaace mannequin commented Jan 6, 2019

    "iso8859_1" is already an alias for "latin_1", though. https://github.com/python/cpython/blob/master/Lib/encodings/aliases.py#L432

    @methane
    Copy link
    Member

    methane commented Feb 25, 2019

    Removing unused alias is OK.
    But I'm not sure about adding new alias.

    In encodings/ package, there are both of mac_centeuro.py and mac_latin2.py.
    Why alias is needed, without removing mac_centeuro.py?

    @methane methane added 3.8 only security fixes and removed 3.7 (EOL) end of life labels Feb 25, 2019
    @BLKSerene
    Copy link
    Mannequin Author

    BLKSerene mannequin commented Feb 25, 2019

    I suppose that mac_centeuro can be removed since it is identical to mac_latin2, and there are already some aliases for mac_latin2. Then, mac_centeuro can be added as an alias for mac_latin2.

    I'm not sure about why latin_1 and iso8859_1 are both supported (they are identical). The doc says:

    "CPython implementation detail: Some common encodings can bypass the codecs lookup machinery to improve performance. These optimization opportunities are only recognized by CPython for a limited set of (case insensitive) aliases: utf-8, utf8, latin-1, latin1, iso-8859-1, iso8859-1, mbcs (Windows only), ascii, us-ascii, utf-16, utf16, utf-32, utf32, and the same using underscores instead of dashes. Using alternative aliases for these encodings may result in slower execution."

    Also not sure whether this would matter or not.

    @methane
    Copy link
    Member

    methane commented Feb 25, 2019

    @lemburg
    I confirmed mac_latin1 and mac_centeuro are identical, even though they are generated from different sources.

    >>> from encodings import mac_latin2, mac_centeuro
    >>> mac_latin2.decoding_table == mac_centeuro.decoding_table
    True

    How do you think about removing mac_centeuro and adding an alias to mac_latin2?

    @malemburg
    Copy link
    Member

    1. Background for "tactis":

    4fd73f0

    # The codecs for these encodings are not distributed with the
    # Python core, but are included here for reference, since the
    # locale module relies on having these aliases available.
    

    This codec was available as separate package at the time. Later the CJK codecs got added to the stdlib, but this codec was not.

    I guess it's fine to remove the alias.

    1. If the mappings are identical, just leaving one and making the other an alias is fine. Same for aliases of those mapping names.

    2. I think we had already resolved this some time ago.

    @csabella
    Copy link
    Contributor

    csabella commented Jun 5, 2019

    New changeset c4c15ed by Cheryl Sabella (Ashwin Ramaswami) in branch 'master':
    bpo-35551: encodings update (GH-11446)
    c4c15ed

    @csabella csabella closed this as completed Jun 5, 2019
    @methane
    Copy link
    Member

    methane commented Jun 6, 2019

    New changeset cb65202 by Inada Naoki in branch 'master':
    bpo-35551: remove mac_centeuro encoding (GH-13856)
    cb65202

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.8 only security fixes topic-unicode type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants