Issue 45921: codecs module doesn't support iso-8859-6-i, iso-8859-6-e, iso-8859-8-i or iso-8859-8-i

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/90079

classification

Title:	codecs module doesn't support iso-8859-6-i, iso-8859-6-e, iso-8859-8-i or iso-8859-8-i
Type:	behavior	Stage:
Components:	Library (Lib)	Versions:	Python 3.11, Python 3.10

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	lemburg, msapiro
Priority:	normal	Keywords:

Created on 2021-11-29 01:38 by msapiro, last changed 2022-04-11 14:59 by admin.

Messages (3)
msg407240 - (view)	Author: Mark Sapiro (msapiro) *	Date: 2021-11-29 01:38
iso-8859-6-i, iso-8859-6-e, iso-8859-8-i and iso-8859-8-i are all IANA recognized character sets per https://www.iana.org/assignments/character-sets/character-sets.xhtml. These are all unrecognized by codecs.lookup().
msg407262 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2021-11-29 11:30
Even though these are IANA recognized encodings, we need to apply he same logic as we do for all new encodings, which essentially boils down to: Are these encoding in wider spread use today ? Reading through the RFC 1556, it seems that the added -i or -e are just indications for applications on how to interpret BIDI information: either implicit by looking at the order of characters in the stream or explicit via control characters embedded in the stream. They are not new encodings, with new mappings. If that's a correct interpretation, we could add those as aliases for the non-annotated encodings. After more than 20 years with Unicode support in Python and the world moving towards UTF-8, I have become fairly reluctant towards adding more encoding support to Python. If people are still using unsupported encodings, it's probably better to point them to other dedicated tools for converting text to UTF-8, e.g. iconv, than extending the pretty extensive support we already have in Python.
msg407305 - (view)	Author: Mark Sapiro (msapiro) *	Date: 2021-11-29 18:08
The Mailman-users@python.org list received a post with the From: header containing a Hebrew display name RFC 2047 encoded with the iso-8859-8-i charset which threw a LookupError: unknown encoding: iso-8859-8-i exception in processing and shunted the message. The message body also had the charset declared as iso-8859-8-i although it contained only ascii. Unfortunately, I don't have the original message so I can't say what MUA created it or how common this usage is. I do think that just adding these as aliases for the non-annotated encodings is an appropriate response.

History
Date	User	Action	Args
2022-04-11 14:59:52	admin	set	github: 90079
2021-11-29 18:08:04	msapiro	set	messages: + msg407305
2021-11-29 11:30:01	lemburg	set	nosy: + lemburg messages: + msg407262
2021-11-29 11:03:51	erlendaasland	set	versions: + Python 3.11, - Python 3.6, Python 3.7, Python 3.8, Python 3.9
2021-11-29 01:38:04	msapiro	create