Title: Encoding and alias issues
Type: enhancement Stage: patch review
Components: Unicode Versions: Python 3.8
Status: open Resolution:
Dependencies: Superseder:
Assigned To: lemburg Nosy List: blkserene, epicfaace, ezio.melotti, inada.naoki, lemburg, vstinner
Priority: normal Keywords: patch, patch, patch, patch

Created on 2018-12-21 10:08 by blkserene, last changed 2019-04-05 12:50 by inada.naoki.

Pull Requests
URL Status Linked Edit
PR 11446 open epicfaace, 2019-01-06 17:25
PR 11446 open epicfaace, 2019-01-06 17:25
PR 11446 epicfaace, 2019-01-06 17:25
PR 11446 open epicfaace, 2019-01-06 17:26
Messages (5)
msg332285 - (view) Author: BLKSerene (blkserene) Date: 2018-12-21 10:08
There're some minor issues about encodings supported by Python.
1. "tis260" is the alias for "tactis", where "tis260" might be a typo, which should be tis620. And "tactis" is not a supported encoding by Python (and I can't find any information about this encoding on Google).
2. "mac_latin2" and "mac_centeuro" refer to the same encoding (the decoding tables are identical), but they are provided as two encodings in different names ("maccentraleurope" is an alias for "mac_latin2", but "mac_centeuro" isn't).
3. The same problem for "latin_1" and "iso8859_1" ("iso_8859_1" is an alias for "latin_1", but "iso8859_1" isn't).
msg333115 - (view) Author: Ashwin Ramaswami (epicfaace) * Date: 2019-01-06 17:24
"iso8859_1" is already an alias for "latin_1", though.
msg336493 - (view) Author: Inada Naoki (inada.naoki) * (Python committer) Date: 2019-02-25 00:09
Removing unused alias is OK.
But I'm not sure about adding new alias.

In encodings/ package, there are both of and
Why alias is needed, without removing
msg336496 - (view) Author: BLKSerene (blkserene) Date: 2019-02-25 04:36
I suppose that mac_centeuro can be removed since it is identical to mac_latin2, and there are already some aliases for mac_latin2. Then, mac_centeuro can be added as an alias for mac_latin2.

I'm not sure about why latin_1 and iso8859_1 are both supported (they are identical). The doc says:

 "CPython implementation detail: Some common encodings can bypass the codecs lookup machinery to improve performance. These optimization opportunities are only recognized by CPython for a limited set of (case insensitive) aliases: utf-8, utf8, latin-1, latin1, iso-8859-1, iso8859-1, mbcs (Windows only), ascii, us-ascii, utf-16, utf16, utf-32, utf32, and the same using underscores instead of dashes. Using alternative aliases for these encodings may result in slower execution."

Also not sure whether this would matter or not.
msg336497 - (view) Author: Inada Naoki (inada.naoki) * (Python committer) Date: 2019-02-25 05:12
I confirmed mac_latin1 and mac_centeuro are identical, even though they are generated from different sources.

>>> from encodings import mac_latin2, mac_centeuro
>>> mac_latin2.decoding_table == mac_centeuro.decoding_table

How do you think about removing mac_centeuro and adding an alias to mac_latin2?
Date User Action Args
2019-04-05 12:50:53inada.naokisetkeywords: patch, patch, patch, patch
assignee: lemburg
2019-02-25 05:12:54inada.naokisetkeywords: patch, patch, patch, patch
nosy: + lemburg
messages: + msg336497

2019-02-25 04:36:42blkserenesetmessages: + msg336496
2019-02-25 00:09:16inada.naokisetversions: + Python 3.8, - Python 3.7
nosy: + inada.naoki

messages: + msg336493

keywords: patch, patch, patch, patch
2019-01-06 17:26:19epicfaacesetkeywords: + patch
stage: patch review
pull_requests: + pull_request10901
2019-01-06 17:26:13epicfaacesetkeywords: + patch
stage: (no value)
pull_requests: + pull_request10902
2019-01-06 17:26:07epicfaacesetkeywords: + patch
stage: (no value)
pull_requests: + pull_request10900
2019-01-06 17:25:57epicfaacesetkeywords: + patch
stage: (no value)
pull_requests: + pull_request10899
2019-01-06 17:24:43epicfaacesetnosy: + epicfaace
messages: + msg333115
2018-12-21 10:08:12blkserenecreate