Issue 18624: Add alias for iso-8859-8-i which is the same as iso-8859-8

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/62824

classification

Title:	Add alias for iso-8859-8-i which is the same as iso-8859-8
Type:	enhancement	Stage:	patch review
Components:	email, Unicode	Versions:	Python 3.4

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	barry, bensws, das, dpg, ezio.melotti, kamie, lemburg, mvolz, r.david.murray
Priority:	normal	Keywords:	easy, needs review, patch

Created on 2013-08-01 23:04 by r.david.murray, last changed 2022-04-11 14:57 by admin.

Files
File name	Uploaded	Description	Edit
adding_aliases.patch	kamie, 2014-03-16 22:19	adding aliases to the iso-8859-8.	review
8859-8_aliases_and_test.patch	bensws, 2014-06-23 00:49	added two aliases to 8859-8, commented out a missing tactis codec, added a test	review

Pull Requests
URL	Status	Linked	Edit
PR 10237	open	fbidu, 2018-10-30 12:14
PR 32279	open	dpg, 2022-04-03 02:54

Messages (11)
msg194134 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2013-08-01 23:04
Emails and web pages may specify a character set of iso-8859-8-i, which has exactly the same code points as iso-8859-8. The -i has to do with how bi-directional text is handled, but doesn't affect the encoding: http://lists.w3.org/Archives/Public/www-validator/2001Apr/0008.html
msg194165 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2013-08-02 08:37
Here's a usable reference: http://www.w3.org/TR/html4/struct/dirlang.html#bidi88598 +1 on adding the alias. Also see http://lists.gnu.org/archive/html/lynx-dev/2012-02/msg00041.html for how Lynx does this. The URL also mentions "iso-8859-8-e", which should probably also be aliased to "iso-8859-8". Both names only apply to visual display characteristics of the text; the encoding is the same.
msg194177 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2013-08-02 14:37
I got the impression from what I read that -e included additional control sequences, but perhaps I misunderstood and that only meant that the data stream was expected to use additional control sequences but the control codes themselves are part of the base codec? I'm specifically thinking of this statement from the linked reference: "Because HTML uses the Unicode bidirectionality algorithm, conforming documents encoded using ISO 8859-8 must be labeled as "ISO-8859-8-i". Explicit directional control is also possible with HTML, but cannot be expressed with ISO 8859-8, so "ISO-8859-8-e" should not be used." The "cannot be expressed" seems to imply there are differences in the codec.
msg194267 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2013-08-03 15:33
On 02.08.2013 16:37, R. David Murray wrote: > > I got the impression from what I read that -e included additional control sequences, but perhaps I misunderstood and that only meant that the data stream was expected to use additional control sequences but the control codes themselves are part of the base codec? > > I'm specifically thinking of this statement from the linked reference: > > "Because HTML uses the Unicode bidirectionality algorithm, conforming documents encoded using ISO 8859-8 must be labeled as "ISO-8859-8-i". Explicit directional control is also possible with HTML, but cannot be expressed with ISO 8859-8, so "ISO-8859-8-e" should not be used." > > The "cannot be expressed" seems to imply there are differences in the codec. No, not really. After some more research, I found that the -i and -e suffixes are defined in RFC 1556: http://tools.ietf.org/html/rfc1556 At the codec level, these encodings are all the same. The suffixes define whether or not to interpret some of their control characters with respect to bidi text when visualizing the text.
msg194362 - (view)	Author: Dan Søndergaard (das)	Date: 2013-08-04 12:51
Is it satisfactory to just add the -i and -e variants to ALIASES in charset.py? Or don't they qualify as "Aliases for other commonly-used names for character sets"?
msg194386 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2013-08-04 15:50
This issue is actually about adding the aliases to the codecs module. I'm not entirely sure at this point what the canonical character set name should be for email output (which is what the ALIASES table controls).
msg213509 - (view)	Author: Kamilla (kamie) *	Date: 2014-03-14 01:42
I'm not sure about how the aliases are represented. I found some examples: http://web.mit.edu/Mozilla/src/mozilla/intl/uconv/src/charsetalias.properties So I wrote the aliases like this: 'iso-8859-8-i' : 'iso8859_8_I', 'iso-8859-8-e' : 'iso8859_8_E', But I'm not sure if I should write as shown in the example above or if it should looks like: 'iso-8859-8-i' : 'iso8859_8', 'iso-8859-8-e' : 'iso8859_8', And how about the tests? I couldn't locate the tests for this module. It it the tests inside the enconded_modules folder?
msg213765 - (view)	Author: Kamilla (kamie) *	Date: 2014-03-16 22:19
Adding aliases to the set of iso-8859-8.
msg213772 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2014-03-16 23:17
From python's point of view they are both aliases of iso-8859_8, as discussed in this issue. Python does not have iso-8859_8-e and i codecs, which you changes to the alias table implies that it does (the target of the entry in the aliases table is the python codec name...and there is only iso8859_8.py, not iso8859_8_E.py or _I.py).
msg213773 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2014-03-16 23:21
The tests are in test_encodings.py. It is interesting that the tests pass with your patch applied; that indicates that there is a missing test, since we should be testing that all of the values in the aliases table are the names of existing codecs, and apparently we aren't.
msg221330 - (view)	Author: Ben Galin (bensws) *	Date: 2014-06-23 00:49
Added a patch with these two 8859-8 aliases and a corresponding test in test_codecs.py (couldn't find test_encodings.py mentioned in an earlier message). The test also found a missing 'tactis' codec (issue 1251921), so I've commented it out in the aliases.py file. Please take a look.

History
Date	User	Action	Args
2022-04-11 14:57:48	admin	set	github: 62824
2022-04-03 02:54:46	dpg	set	nosy: + dpg pull_requests: + pull_request30340
2018-10-30 12:14:39	fbidu	set	pull_requests: + pull_request9550
2014-08-03 19:19:09	jesstess	set	keywords: + needs review stage: needs patch -> patch review
2014-06-23 00:49:16	bensws	set	files: + 8859-8_aliases_and_test.patch nosy: + bensws messages: + msg221330
2014-03-16 23:21:59	r.david.murray	set	messages: + msg213773
2014-03-16 23:17:18	r.david.murray	set	messages: + msg213772
2014-03-16 22:19:16	kamie	set	files: + adding_aliases.patch keywords: + patch messages: + msg213765
2014-03-14 01:42:51	kamie	set	nosy: + kamie messages: + msg213509
2014-03-09 23:04:38	mvolz	set	nosy: + mvolz
2013-08-04 15:50:03	r.david.murray	set	messages: + msg194386
2013-08-04 12:51:47	das	set	nosy: + das messages: + msg194362
2013-08-03 15:33:58	lemburg	set	messages: + msg194267
2013-08-02 14:37:42	r.david.murray	set	messages: + msg194177
2013-08-02 08:37:18	lemburg	set	nosy: + lemburg messages: + msg194165
2013-08-01 23:04:29	r.david.murray	create