This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Add encoding aliases from the (HTML5) Encoding Standard
Type: enhancement Stage: test needed
Components: Unicode Versions: Python 3.8
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: ezio.melotti, fbidu, lemburg, loewis, zwol
Priority: normal Keywords: patch

Created on 2015-10-15 18:13 by zwol, last changed 2022-04-11 14:58 by admin.

Pull Requests
URL Status Linked Edit
PR 10237 open fbidu, 2018-10-30 12:14
Messages (4)
msg253061 - (view) Author: Zack Weinberg (zwol) * Date: 2015-10-15 18:13
The codecs registry (as of 3.4) is unaware of two of the canonical encoding names from <https://encoding.spec.whatwg.org/#names-and-labels>: "windows-874" and "x-mac-cyrillic".  For interoperability's sake, please make these aliases for "cp874" and "mac_cyrillic" respectively.

(For full interop, *every* name and label in that list should be understood by str.encode(), but the canonical names are most useful.  Lack of support for iso-8859-i is already reported as https://bugs.python.org/issue18624 .  I have not tested the full set of non-canonical labels.)
msg328990 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2018-10-31 12:25
Adding those aliases sounds good to me.  I think it would be good to add some tests first (possibly as a separate issue/pr), even though I'm not sure what would be the best way to test the aliases.

Testing if the list is complete/correct should be done against the HTML5/Unicode specs, but that, if automated, would require downloading/parsing the specs and is probably not worth doing it.

We can also check that all the aliases are accepted by str.encode/decode, and all corresponding aliases should give the same result, however if str.encode/decode use the aliases dict, the test is nothing more than a sanity check and won't detect e.g. typos in the aliases names, or wrongly assigned aliases.
msg329093 - (view) Author: Felipe Rodrigues (fbidu) * Date: 2018-11-02 01:17
Ezio, I have issued a simple PR that adds just the two aliases cited in the issue's initial message. I would like to implement tests but as I wrote in the PR's message, I'm not really sure how to proceed with that. bpo-18624 is really related to this issue and in there is a reference to a test_codecs.py file that I did not find.

If you could give me a few pointer on how to proceed, I'll be glad to improve my PR, add tests and even add all the other aliases that are missing.
msg329115 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2018-11-02 08:39
Please note that we can only add aliases if the encodings are indeed the same. Given that WhatWG has made changes to several standard encodings, this is especially important, since our codecs are mostly based on what the Unicode consortium defines as these encodings.

Tests for aliases can be minimal: just verify that the codecs subsystem detects them and results in the correct codec being used. There's no need to download any WhatWG specs for this.
History
Date User Action Args
2022-04-11 14:58:22adminsetgithub: 69602
2018-11-02 08:39:30lemburgsetmessages: + msg329115
2018-11-02 01:17:55fbidusetmessages: + msg329093
2018-10-31 20:09:50vstinnersetnosy: - vstinner
2018-10-31 12:25:16ezio.melottisetversions: + Python 3.8, - Python 3.6
nosy: + fbidu

messages: + msg328990

stage: patch review -> test needed
2018-10-30 12:14:39fbidusetkeywords: + patch
stage: needs patch -> patch review
pull_requests: + pull_request9549
2015-10-15 18:37:56serhiy.storchakasetnosy: + ezio.melotti, lemburg, loewis, vstinner
stage: needs patch

components: + Unicode
versions: + Python 3.6
2015-10-15 18:13:07zwolcreate