classification
Title: Encoding and alias issues
Type: enhancement Stage: resolved
Components: Unicode Versions: Python 3.8
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: lemburg Nosy List: blkserene, cheryl.sabella, epicfaace, ezio.melotti, inada.naoki, lemburg, vstinner
Priority: normal Keywords: patch, patch, patch, patch

Created on 2018-12-21 10:08 by blkserene, last changed 2019-06-06 05:39 by inada.naoki. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 11446 merged epicfaace, 2019-01-06 17:25
PR 11446 merged epicfaace, 2019-01-06 17:25
PR 11446 merged epicfaace, 2019-01-06 17:25
PR 11446 merged epicfaace, 2019-01-06 17:26
PR 13856 merged inada.naoki, 2019-06-06 03:35
Messages (8)
msg332285 - (view) Author: BLKSerene (blkserene) Date: 2018-12-21 10:08
There're some minor issues about encodings supported by Python.
1. "tis260" is the alias for "tactis", where "tis260" might be a typo, which should be tis620. And "tactis" is not a supported encoding by Python (and I can't find any information about this encoding on Google).
2. "mac_latin2" and "mac_centeuro" refer to the same encoding (the decoding tables are identical), but they are provided as two encodings in different names ("maccentraleurope" is an alias for "mac_latin2", but "mac_centeuro" isn't).
3. The same problem for "latin_1" and "iso8859_1" ("iso_8859_1" is an alias for "latin_1", but "iso8859_1" isn't).
msg333115 - (view) Author: Ashwin Ramaswami (epicfaace) * Date: 2019-01-06 17:24
"iso8859_1" is already an alias for "latin_1", though. https://github.com/python/cpython/blob/master/Lib/encodings/aliases.py#L432
msg336493 - (view) Author: Inada Naoki (inada.naoki) * (Python committer) Date: 2019-02-25 00:09
Removing unused alias is OK.
But I'm not sure about adding new alias.

In encodings/ package, there are both of mac_centeuro.py and mac_latin2.py.
Why alias is needed, without removing mac_centeuro.py?
msg336496 - (view) Author: BLKSerene (blkserene) Date: 2019-02-25 04:36
I suppose that mac_centeuro can be removed since it is identical to mac_latin2, and there are already some aliases for mac_latin2. Then, mac_centeuro can be added as an alias for mac_latin2.

I'm not sure about why latin_1 and iso8859_1 are both supported (they are identical). The doc says:

 "CPython implementation detail: Some common encodings can bypass the codecs lookup machinery to improve performance. These optimization opportunities are only recognized by CPython for a limited set of (case insensitive) aliases: utf-8, utf8, latin-1, latin1, iso-8859-1, iso8859-1, mbcs (Windows only), ascii, us-ascii, utf-16, utf16, utf-32, utf32, and the same using underscores instead of dashes. Using alternative aliases for these encodings may result in slower execution."

Also not sure whether this would matter or not.
msg336497 - (view) Author: Inada Naoki (inada.naoki) * (Python committer) Date: 2019-02-25 05:12
@lemburg
I confirmed mac_latin1 and mac_centeuro are identical, even though they are generated from different sources.

>>> from encodings import mac_latin2, mac_centeuro
>>> mac_latin2.decoding_table == mac_centeuro.decoding_table
True

How do you think about removing mac_centeuro and adding an alias to mac_latin2?
msg344749 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2019-06-05 16:55
1. Background for "tactis":

https://github.com/python/cpython/commit/4fd73f0465ba11c22f0986d04cf91b387ed22c47

    # The codecs for these encodings are not distributed with the
    # Python core, but are included here for reference, since the
    # locale module relies on having these aliases available.

This codec was available as separate package at the time. Later the CJK codecs got added to the stdlib, but this codec was not.

I guess it's fine to remove the alias.

2. If the mappings are identical, just leaving one and making the other an alias is fine. Same for aliases of those mapping names.

3. I think we had already resolved this some time ago.
msg344773 - (view) Author: Cheryl Sabella (cheryl.sabella) * (Python committer) Date: 2019-06-05 22:18
New changeset c4c15ed7a2c7c2a1983e88b89c244d121eb3e512 by Cheryl Sabella (Ashwin Ramaswami) in branch 'master':
bpo-35551: encodings update (GH-11446)
https://github.com/python/cpython/commit/c4c15ed7a2c7c2a1983e88b89c244d121eb3e512
msg344788 - (view) Author: Inada Naoki (inada.naoki) * (Python committer) Date: 2019-06-06 05:39
New changeset cb65202520e7959196a2df8215692de155bf0cc8 by Inada Naoki in branch 'master':
bpo-35551: remove mac_centeuro encoding (GH-13856)
https://github.com/python/cpython/commit/cb65202520e7959196a2df8215692de155bf0cc8
History
Date User Action Args
2019-06-06 05:39:04inada.naokisetmessages: + msg344788
2019-06-06 03:35:54inada.naokisetpull_requests: + pull_request13731
2019-06-05 22:20:09cheryl.sabellasetkeywords: patch, patch, patch, patch
status: open -> closed
resolution: fixed
stage: patch review -> resolved
2019-06-05 22:18:14cheryl.sabellasetnosy: + cheryl.sabella
messages: + msg344773
2019-06-05 16:55:07lemburgsetkeywords: patch, patch, patch, patch

messages: + msg344749
2019-04-05 12:50:53inada.naokisetkeywords: patch, patch, patch, patch
assignee: lemburg
2019-02-25 05:12:54inada.naokisetkeywords: patch, patch, patch, patch
nosy: + lemburg
messages: + msg336497

2019-02-25 04:36:42blkserenesetmessages: + msg336496
2019-02-25 00:09:16inada.naokisetversions: + Python 3.8, - Python 3.7
nosy: + inada.naoki

messages: + msg336493

keywords: patch, patch, patch, patch
2019-01-06 17:26:19epicfaacesetkeywords: + patch
stage: patch review
pull_requests: + pull_request10901
2019-01-06 17:26:13epicfaacesetkeywords: + patch
stage: (no value)
pull_requests: + pull_request10902
2019-01-06 17:26:07epicfaacesetkeywords: + patch
stage: (no value)
pull_requests: + pull_request10900
2019-01-06 17:25:57epicfaacesetkeywords: + patch
stage: (no value)
pull_requests: + pull_request10899
2019-01-06 17:24:43epicfaacesetnosy: + epicfaace
messages: + msg333115
2018-12-21 10:08:12blkserenecreate