Issue 1249749: Encodings and aliases do not match runtime

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/42237

classification

Title:	Encodings and aliases do not match runtime
Type:	enhancement	Stage:
Components:	Documentation	Versions:	Python 3.2

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:	docs@python	Nosy List:	docs@python, lemburg, liturgist, loewis
Priority:	low	Keywords:

Created on 2005-08-01 18:23 by liturgist, last changed 2022-04-11 14:56 by admin.

Files
File name	Uploaded	Description	Edit
encodingaliases.py	liturgist, 2005-08-10 21:29	encodingaliases.py

Messages (11)
msg25927 - (view)	Author: liturgist (liturgist)	Date: 2005-08-01 18:23
2.4.1 documentation has a list of standard encodings in 4.9.2. However, this list does not seem to match what is returned by the runtime. Below is code to dump out the encodings and aliases. Please tell me if anything is incorrect. In some cases, there are many more valid aliases than listed in the documentation. See 'cp037' as an example. I see that the identifiers are intended to be case insensitive. I would prefer to see the documentation provide the identifiers as they will appear in encodings.aliases.aliases. The only alias containing any upper case letters appears to be 'hp_roman8'. $ cat encodingaliases.py #!/usr/bin/env python import sys import encodings def main(): enchash = {} for enc in encodings.aliases.aliases.values(): enchash[enc] = [] for encalias in encodings.aliases.aliases.keys(): enchash[encodings.aliases.aliases[encalias]].append(encalias) elist = enchash.keys() elist.sort() for enc in elist: print enc, enchash[enc] if __name__ == '__main__': main() sys.exit(0) 13:12 pwatson [ ruth.knightsbridge.com:/home/pwatson/src/python ] 366 $ ./encodingaliases.py ascii ['iso_ir_6', 'ansi_x3_4_1968', 'ibm367', 'iso646_us', 'us', 'cp367', '646', 'us_ascii', 'csascii', 'ansi_x3.4_1986', 'iso_646.irv_1991', 'ansi_x3.4_1968'] base64_codec ['base_64', 'base64'] big5 ['csbig5', 'big5_tw'] big5hkscs ['hkscs', 'big5_hkscs'] bz2_codec ['bz2'] cp037 ['ebcdic_cp_wt', 'ebcdic_cp_us', 'ebcdic_cp_nl', '037', 'ibm039', 'ibm037', 'csibm037', 'ebcdic_cp_ca'] cp1026 ['csibm1026', 'ibm1026', '1026'] cp1140 ['1140', 'ibm1140'] cp1250 ['1250', 'windows_1250'] cp1251 ['1251', 'windows_1251'] cp1252 ['windows_1252', '1252'] cp1253 ['1253', 'windows_1253'] cp1254 ['1254', 'windows_1254'] cp1255 ['1255', 'windows_1255'] cp1256 ['1256', 'windows_1256'] cp1257 ['1257', 'windows_1257'] cp1258 ['1258', 'windows_1258'] cp424 ['ebcdic_cp_he', 'ibm424', '424', 'csibm424'] cp437 ['ibm437', '437', 'cspc8codepage437'] cp500 ['csibm500', 'ibm500', '500', 'ebcdic_cp_ch', 'ebcdic_cp_be'] cp775 ['cspc775baltic', '775', 'ibm775'] cp850 ['ibm850', 'cspc850multilingual', '850'] cp852 ['ibm852', '852', 'cspcp852'] cp855 ['csibm855', 'ibm855', '855'] cp857 ['csibm857', 'ibm857', '857'] cp860 ['csibm860', 'ibm860', '860'] cp861 ['csibm861', 'cp_is', 'ibm861', '861'] cp862 ['cspc862latinhebrew', 'ibm862', '862'] cp863 ['csibm863', 'ibm863', '863'] cp864 ['csibm864', 'ibm864', '864'] cp865 ['csibm865', 'ibm865', '865'] cp866 ['csibm866', 'ibm866', '866'] cp869 ['csibm869', 'ibm869', '869', 'cp_gr'] cp932 ['mskanji', '932', 'ms932', 'ms_kanji'] cp949 ['uhc', 'ms949', '949'] cp950 ['ms950', '950'] euc_jis_2004 ['eucjis2004', 'jisx0213', 'euc_jis2004'] euc_jisx0213 ['eucjisx0213'] euc_jp ['eucjp', 'ujis', 'u_jis'] euc_kr ['ksc5601', 'korean', 'euckr', 'ksx1001', 'ks_c_5601', 'ks_c_5601_1987', 'ks_x_1001'] gb18030 ['gb18030_2000'] gb2312 ['chinese', 'euc_cn', 'csiso58gb231280', 'iso_ir_58', 'euccn', 'eucgb2312_cn', 'gb2312_1980', 'gb2312_80'] gbk ['cp936', 'ms936', '936'] hex_codec ['hex'] hp_roman8 ['csHPRoman8', 'r8', 'roman8'] hz ['hzgb', 'hz_gb_2312', 'hz_gb'] iso2022_jp ['iso2022jp', 'iso_2022_jp', 'csiso2022jp'] iso2022_jp_1 ['iso_2022_jp_1', 'iso2022jp_1'] iso2022_jp_2 ['iso_2022_jp_2', 'iso2022jp_2'] iso2022_jp_2004 ['iso_2022_jp_2004', 'iso2022jp_2004'] iso2022_jp_3 ['iso_2022_jp_3', 'iso2022jp_3'] iso2022_jp_ext ['iso2022jp_ext', 'iso_2022_jp_ext'] iso2022_kr ['iso_2022_kr', 'iso2022kr', 'csiso2022kr'] iso8859_10 ['csisolatin6', 'l6', 'iso_8859_10_1992', 'iso_ir_157', 'iso_8859_10', 'latin6'] iso8859_11 ['iso_8859_11', 'thai', 'iso_8859_11_2001'] iso8859_13 ['iso_8859_13'] iso8859_14 ['iso_celtic', 'iso_ir_199', 'l8', 'iso_8859_14_1998', 'iso_8859_14', 'latin8'] iso8859_15 ['iso_8859_15'] iso8859_16 ['iso_8859_16_2001', 'l10', 'iso_ir_226', 'latin10', 'iso_8859_16'] iso8859_2 ['l2', 'csisolatin2', 'iso_ir_101', 'iso_8859_2', 'iso_8859_2_1987', 'latin2'] iso8859_3 ['iso_8859_3_1988', 'l3', 'iso_ir_109', 'csisolatin3', 'iso_8859_3', 'latin3'] iso8859_4 ['csisolatin4', 'l4', 'iso_ir_110', 'iso_8859_4', 'iso_8859_4_1988', 'latin4'] iso8859_5 ['iso_8859_5_1988', 'iso_8859_5', 'cyrillic', 'csisolatincyrillic', 'iso_ir_144'] iso8859_6 ['iso_8859_6_1987', 'iso_ir_127', 'csisolatinarabic', 'asmo_708', 'iso_8859_6', 'ecma_114', 'arabic'] iso8859_7 ['ecma_118', 'greek8', 'iso_8859_7', 'iso_ir_126', 'elot_928', 'iso_8859_7_1987', 'csisolatingreek', 'greek'] iso8859_8 ['iso_8859_8_1988', 'iso_ir_138', 'iso_8859_8', 'csisolatinhebrew', 'hebrew'] iso8859_9 ['l5', 'iso_8859_9_1989', 'iso_8859_9', 'csisolatin5', 'latin5', 'iso_ir_148'] johab ['cp1361', 'ms1361'] koi8_r ['cskoi8r'] latin_1 ['iso8859', 'csisolatin1', 'latin', 'l1', 'iso_ir_100', 'ibm819', 'cp819', 'iso_8859_1', 'latin1', 'iso_8859_1_1987', '8859'] mac_cyrillic ['maccyrillic'] mac_greek ['macgreek'] mac_iceland ['maciceland'] mac_latin2 ['maccentraleurope', 'maclatin2'] mac_roman ['macroman'] mac_turkish ['macturkish'] mbcs ['dbcs'] ptcp154 ['cp154', 'cyrillic-asian', 'csptcp154', 'pt154'] quopri_codec ['quopri', 'quoted_printable', 'quotedprintable'] rot_13 ['rot13'] shift_jis ['s_jis', 'sjis', 'shiftjis', 'csshiftjis'] shift_jis_2004 ['shiftjis2004', 's_jis_2004', 'sjis_2004'] shift_jisx0213 ['shiftjisx0213', 'sjisx0213', 's_jisx0213'] tactis ['tis260'] tis_620 ['tis620', 'tis_620_2529_1', 'tis_620_2529_0', 'iso_ir_166', 'tis_620_0'] utf_16 ['utf16', 'u16'] utf_16_be ['utf_16be', 'unicodebigunmarked'] utf_16_le ['utf_16le', 'unicodelittleunmarked'] utf_7 ['u7', 'utf7'] utf_8 ['u8', 'utf', 'utf8_ucs4', 'utf8_ucs2', 'utf8'] uu_codec ['uu'] zlib_codec ['zlib', 'zip']
msg25928 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2005-08-04 14:47
Logged In: YES user_id=38388 Doc patches are welcome - perhaps you could enhance your script to have the doc table generated from the available codecs and aliases ?! Thanks.
msg25929 - (view)	Author: liturgist (liturgist)	Date: 2005-08-05 17:53
Logged In: YES user_id=197677 I would very much like to produce the doc table from code. However, I have a few questions. It seems that encodings.aliases.aliases is a list of all encodings and not necessarily those supported on all machines. Ie. mbcs on UNIX or embedded systems that might exclude some large character sets to save space. Is this correct? If so, will it remain that way? To find out if an encoding is supported on the current machine, the code should handle the exception generated when codecs.lookup() fails. Right? To generate the table, I need to produce the "Languages" field. This information does not seem to be available from the Python runtime. I would much rather see this information, including a localized version of the string, come from the Python runtime, rather than hardcode it into the script. Is that a possibility? Would it be a better approach? The non-language oriented encodings such as base_64 and rot_13 do not seem to have anything that distinguishes them from human languages. How can these be separated out without hardcoding? Likewise, the non-language encodings have an "Operand type" field which would need to be generated. My feeling is, again, that this should come from the Python runtime and not be hardcoded into the doc generation script. Any suggestions?
msg25930 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2005-08-06 12:41
Logged In: YES user_id=21627 I would not like to see the documentation contain a complete list of all aliases. The documentation points out that this are "a few common aliases", ie. I selected aliases that people are likely to encounter, and are encouraged to use. I don't think it is useful to produce the table from the code. If you want to know everything in aliases, just look at aliases directly.
msg25931 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2005-08-06 12:49
Logged In: YES user_id=38388 Martin, I don't see any problem with putting the complete list of aliases into the documentation. liturgist, don't worry about hard-coding things into the script. The extra information Martin gave in the table is not likely going to become part of the standard lib, because there's no a lot you can do with it programmatically.
msg25932 - (view)	Author: liturgist (liturgist)	Date: 2005-08-10 21:29
Logged In: YES user_id=197677 The script attached generates two HTML tables in files specified on the command line. usage: encodingaliases.py <language-oriented-codecs-html-file> <non-language-oriented-codecs-html-file> A static list of codecs in this script is used because the language description is not available in the python runtime. Codecs found in the encodings.aliases.aliases list are added to the list, but will be described as "unknown" encodings. The "bijectiveType" was, like the language descriptions, taken from the current (2.4.1) documentation. It would be much better for the descriptions and "bijective" type settings to come from the runtime. The problem is one of maintenance. Without these available for introspection in the runtime, a new encoding with no alias will never be identified. When it does appear with an alias, it can only be described as "unknown."
msg25933 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2005-08-10 21:59
Logged In: YES user_id=21627 I do see a problem with generating these tables automatically. It suggests the reader that the aliases are all equally relevant. However, I bet few people have ever heard of or used, say, 'cspc850multilingual'. As for the actual patch: Please don't generate HTML. Instead, TeX should be generated, as this is the primary source. Also please add a patch to the current TeX file, updating it appropriately.
msg25934 - (view)	Author: liturgist (liturgist)	Date: 2005-08-11 02:54
Logged In: YES user_id=197677 For example: there appears to be a codec for iso8859-1, but it has no alias in the encodings.aliases.aliases list and it is not in the current documentation. What is the relationship of iso8859_1 to latin_1? Should iso8859_1 be considered a base codec? When should iso8859_1 be used rather than latin_1?
msg25935 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2005-08-11 05:56
Logged In: YES user_id=21627 I think the presence of iso8859_1.py is a bug, resulting from automatic generation of these files. The file should be deleted; iso8859-1 should be encoded through the alias to latin-1. Thanks for pointing that out.
msg25936 - (view)	Author: liturgist (liturgist)	Date: 2005-08-11 14:31
Logged In: YES user_id=197677 If it does not present a problem, making latin_1 and alias for iso8859_1 as the base codec would present the ISO standards as a complete, orthogonal set. The alias would mean that no existing code is broken. Right? Would this approach present any problem? Should this become a separate bug entry?
msg25937 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2005-08-11 22:10
Logged In: YES user_id=21627 It does present a problem: the latin-1 codec is faster than the iso8859-1 codec, as it is a special case in C (employin the fact that Latin-1 and Unicode share the first 256 code points). So I think the iso8859-1 should be dropped. But, as you guess, this is an issue independent of the documentation issue at hand, and should be reported (and resolved) separately.

History
Date	User	Action	Args
2022-04-11 14:56:12	admin	set	github: 42237
2020-09-19 19:04:08	georg.brandl	set	nosy: - georg.brandl
2010-08-21 18:46:24	BreamoreBoy	set	assignee: georg.brandl -> docs@python nosy: + docs@python versions: + Python 3.2, - Python 2.7
2009-04-05 17:27:54	georg.brandl	set	assignee: georg.brandl nosy: + georg.brandl
2009-02-16 00:46:03	ajaksu2	set	priority: normal -> low type: enhancement versions: + Python 2.7, - Python 2.4
2005-08-01 18:23:30	liturgist	create