Title: codecs module doesn't support iso-8859-6-i, iso-8859-6-e, iso-8859-8-i or iso-8859-8-i
Type: behavior Stage:
Components: Library (Lib) Versions: Python 3.11, Python 3.10
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: lemburg, msapiro
Priority: normal Keywords:

Created on 2021-11-29 01:38 by msapiro, last changed 2021-11-29 18:08 by msapiro.

Messages (3)
msg407240 - (view) Author: Mark Sapiro (msapiro) * (Python triager) Date: 2021-11-29 01:38
iso-8859-6-i, iso-8859-6-e, iso-8859-8-i and iso-8859-8-i are all IANA recognized character sets per These are all unrecognized by codecs.lookup().
msg407262 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2021-11-29 11:30
Even though these are IANA recognized encodings, we need to apply he same logic as we do for all new encodings, which essentially boils down to: Are these encoding in wider spread use today ?

Reading through the RFC 1556, it seems that the added -i or -e are just indications for applications on how to interpret BIDI information: either implicit by looking at the order of characters in the stream or explicit via control characters embedded in the stream. They are not new encodings, with new mappings.

If that's a correct interpretation, we could add those as aliases for the non-annotated encodings.

After more than 20 years with Unicode support in Python and the world moving towards UTF-8, I have become fairly reluctant towards adding more encoding support to Python.

If people are still using unsupported encodings, it's probably better to point them to other dedicated tools for converting text to UTF-8, e.g. iconv, than extending the pretty extensive support we already have in Python.
msg407305 - (view) Author: Mark Sapiro (msapiro) * (Python triager) Date: 2021-11-29 18:08
The list received a post with the From: header containing a Hebrew display name RFC 2047 encoded with the iso-8859-8-i charset which threw a LookupError: unknown encoding: iso-8859-8-i exception in processing and shunted the message. The message body also had the charset declared as iso-8859-8-i although it contained only ascii. Unfortunately, I don't have the original message so I can't say what MUA created it or how common this usage is.

I do think that just adding these as aliases for the non-annotated encodings is an appropriate response.
Date User Action Args
2021-11-29 18:08:04msapirosetmessages: + msg407305
2021-11-29 11:30:01lemburgsetnosy: + lemburg
messages: + msg407262
2021-11-29 11:03:51erlendaaslandsetversions: + Python 3.11, - Python 3.6, Python 3.7, Python 3.8, Python 3.9
2021-11-29 01:38:04msapirocreate