Title: Encodings and aliases do not match runtime
Type: enhancement Stage:
Components: Documentation Versions: Python 3.2
Status: open Resolution:
Dependencies: Superseder:
Assigned To: docs@python Nosy List: docs@python, lemburg, liturgist, loewis
Priority: low Keywords:

Created on 2005-08-01 18:23 by liturgist, last changed 2020-09-19 19:04 by georg.brandl.

File name Uploaded Description Edit liturgist, 2005-08-10 21:29
Messages (11)
msg25927 - (view) Author: liturgist (liturgist) Date: 2005-08-01 18:23
2.4.1 documentation has a list of standard encodings in
4.9.2.  However, this list does not seem to match what
is returned by the runtime.  Below is code to dump out
the encodings and aliases.  Please tell me if anything
is incorrect.

In some cases, there are many more valid aliases than
listed in the documentation.  See 'cp037' as an example.

I see that the identifiers are intended to be case
insensitive.  I would prefer to see the documentation
provide the identifiers as they will appear in
encodings.aliases.aliases.  The only alias containing
any upper case letters appears to be 'hp_roman8'.

$ cat
#!/usr/bin/env python
import sys
import encodings

def main():
    enchash = {}

    for enc in encodings.aliases.aliases.values():
        enchash[enc] = []
    for encalias in encodings.aliases.aliases.keys():

    elist = enchash.keys()
    for enc in elist:
        print enc, enchash[enc]

if __name__ == '__main__':
13:12 pwatson [ ] 366
$ ./
ascii ['iso_ir_6', 'ansi_x3_4_1968', 'ibm367',
'iso646_us', 'us', 'cp367', '646', 'us_ascii',
'csascii', 'ansi_x3.4_1986', 'iso_646.irv_1991',
base64_codec ['base_64', 'base64']
big5 ['csbig5', 'big5_tw']
big5hkscs ['hkscs', 'big5_hkscs']
bz2_codec ['bz2']
cp037 ['ebcdic_cp_wt', 'ebcdic_cp_us', 'ebcdic_cp_nl',
'037', 'ibm039', 'ibm037', 'csibm037', 'ebcdic_cp_ca']
cp1026 ['csibm1026', 'ibm1026', '1026']
cp1140 ['1140', 'ibm1140']
cp1250 ['1250', 'windows_1250']
cp1251 ['1251', 'windows_1251']
cp1252 ['windows_1252', '1252']
cp1253 ['1253', 'windows_1253']
cp1254 ['1254', 'windows_1254']
cp1255 ['1255', 'windows_1255']
cp1256 ['1256', 'windows_1256']
cp1257 ['1257', 'windows_1257']
cp1258 ['1258', 'windows_1258']
cp424 ['ebcdic_cp_he', 'ibm424', '424', 'csibm424']
cp437 ['ibm437', '437', 'cspc8codepage437']
cp500 ['csibm500', 'ibm500', '500', 'ebcdic_cp_ch',
cp775 ['cspc775baltic', '775', 'ibm775']
cp850 ['ibm850', 'cspc850multilingual', '850']
cp852 ['ibm852', '852', 'cspcp852']
cp855 ['csibm855', 'ibm855', '855']
cp857 ['csibm857', 'ibm857', '857']
cp860 ['csibm860', 'ibm860', '860']
cp861 ['csibm861', 'cp_is', 'ibm861', '861']
cp862 ['cspc862latinhebrew', 'ibm862', '862']
cp863 ['csibm863', 'ibm863', '863']
cp864 ['csibm864', 'ibm864', '864']
cp865 ['csibm865', 'ibm865', '865']
cp866 ['csibm866', 'ibm866', '866']
cp869 ['csibm869', 'ibm869', '869', 'cp_gr']
cp932 ['mskanji', '932', 'ms932', 'ms_kanji']
cp949 ['uhc', 'ms949', '949']
cp950 ['ms950', '950']
euc_jis_2004 ['eucjis2004', 'jisx0213', 'euc_jis2004']
euc_jisx0213 ['eucjisx0213']
euc_jp ['eucjp', 'ujis', 'u_jis']
euc_kr ['ksc5601', 'korean', 'euckr', 'ksx1001',
'ks_c_5601', 'ks_c_5601_1987', 'ks_x_1001']
gb18030 ['gb18030_2000']
gb2312 ['chinese', 'euc_cn', 'csiso58gb231280',
'iso_ir_58', 'euccn', 'eucgb2312_cn', 'gb2312_1980',
gbk ['cp936', 'ms936', '936']
hex_codec ['hex']
hp_roman8 ['csHPRoman8', 'r8', 'roman8']
hz ['hzgb', 'hz_gb_2312', 'hz_gb']
iso2022_jp ['iso2022jp', 'iso_2022_jp', 'csiso2022jp']
iso2022_jp_1 ['iso_2022_jp_1', 'iso2022jp_1']
iso2022_jp_2 ['iso_2022_jp_2', 'iso2022jp_2']
iso2022_jp_2004 ['iso_2022_jp_2004', 'iso2022jp_2004']
iso2022_jp_3 ['iso_2022_jp_3', 'iso2022jp_3']
iso2022_jp_ext ['iso2022jp_ext', 'iso_2022_jp_ext']
iso2022_kr ['iso_2022_kr', 'iso2022kr', 'csiso2022kr']
iso8859_10 ['csisolatin6', 'l6', 'iso_8859_10_1992',
'iso_ir_157', 'iso_8859_10', 'latin6']
iso8859_11 ['iso_8859_11', 'thai', 'iso_8859_11_2001']
iso8859_13 ['iso_8859_13']
iso8859_14 ['iso_celtic', 'iso_ir_199', 'l8',
'iso_8859_14_1998', 'iso_8859_14', 'latin8']
iso8859_15 ['iso_8859_15']
iso8859_16 ['iso_8859_16_2001', 'l10', 'iso_ir_226',
'latin10', 'iso_8859_16']
iso8859_2 ['l2', 'csisolatin2', 'iso_ir_101',
'iso_8859_2', 'iso_8859_2_1987', 'latin2']
iso8859_3 ['iso_8859_3_1988', 'l3', 'iso_ir_109',
'csisolatin3', 'iso_8859_3', 'latin3']
iso8859_4 ['csisolatin4', 'l4', 'iso_ir_110',
'iso_8859_4', 'iso_8859_4_1988', 'latin4']
iso8859_5 ['iso_8859_5_1988', 'iso_8859_5', 'cyrillic',
'csisolatincyrillic', 'iso_ir_144']
iso8859_6 ['iso_8859_6_1987', 'iso_ir_127',
'csisolatinarabic', 'asmo_708', 'iso_8859_6',
'ecma_114', 'arabic']
iso8859_7 ['ecma_118', 'greek8', 'iso_8859_7',
'iso_ir_126', 'elot_928', 'iso_8859_7_1987',
'csisolatingreek', 'greek']
iso8859_8 ['iso_8859_8_1988', 'iso_ir_138',
'iso_8859_8', 'csisolatinhebrew', 'hebrew']
iso8859_9 ['l5', 'iso_8859_9_1989', 'iso_8859_9',
'csisolatin5', 'latin5', 'iso_ir_148']
johab ['cp1361', 'ms1361']
koi8_r ['cskoi8r']
latin_1 ['iso8859', 'csisolatin1', 'latin', 'l1',
'iso_ir_100', 'ibm819', 'cp819', 'iso_8859_1',
'latin1', 'iso_8859_1_1987', '8859']
mac_cyrillic ['maccyrillic']
mac_greek ['macgreek']
mac_iceland ['maciceland']
mac_latin2 ['maccentraleurope', 'maclatin2']
mac_roman ['macroman']
mac_turkish ['macturkish']
mbcs ['dbcs']
ptcp154 ['cp154', 'cyrillic-asian', 'csptcp154', 'pt154']
quopri_codec ['quopri', 'quoted_printable',
rot_13 ['rot13']
shift_jis ['s_jis', 'sjis', 'shiftjis', 'csshiftjis']
shift_jis_2004 ['shiftjis2004', 's_jis_2004', 'sjis_2004']
shift_jisx0213 ['shiftjisx0213', 'sjisx0213', 's_jisx0213']
tactis ['tis260']
tis_620 ['tis620', 'tis_620_2529_1', 'tis_620_2529_0',
'iso_ir_166', 'tis_620_0']
utf_16 ['utf16', 'u16']
utf_16_be ['utf_16be', 'unicodebigunmarked']
utf_16_le ['utf_16le', 'unicodelittleunmarked']
utf_7 ['u7', 'utf7']
utf_8 ['u8', 'utf', 'utf8_ucs4', 'utf8_ucs2', 'utf8']
uu_codec ['uu']
zlib_codec ['zlib', 'zip']
msg25928 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2005-08-04 14:47
Logged In: YES 

Doc patches are welcome - perhaps you could enhance your
script to have the doc table generated from the available
codecs and aliases ?!

msg25929 - (view) Author: liturgist (liturgist) Date: 2005-08-05 17:53
Logged In: YES 

I would very much like to produce the doc table from code. 
However, I have a few questions.

It seems that encodings.aliases.aliases is a list of all
encodings and not necessarily those supported on all
machines.  Ie. mbcs on UNIX or embedded systems that might
exclude some large character sets to save space.  Is this
correct?  If so, will it remain that way?

To find out if an encoding is supported on the current
machine, the code should handle the exception generated when
codecs.lookup() fails.  Right?

To generate the table, I need to produce the "Languages"
field.  This information does not seem to be available from
the Python runtime.  I would much rather see this
information, including a localized version of the string,
come from the Python runtime, rather than hardcode it into
the script.  Is that a possibility?   Would it be a better

The non-language oriented encodings such as base_64 and
rot_13 do not seem to have anything that distinguishes them
from human languages.  How can these be separated out
without hardcoding?

Likewise, the non-language encodings have an "Operand type"
field which would need to be generated.  My feeling is,
again, that this should come from the Python runtime and not
be hardcoded into the doc generation script.  Any suggestions?
msg25930 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2005-08-06 12:41
Logged In: YES 

I would not like to see the documentation contain a complete
list of all aliases. The documentation points out that this
are "a few common aliases", ie. I selected aliases that
people are likely to encounter, and are encouraged to use.

I don't think it is useful to produce the table from the
code. If you want to know everything in aliases, just look
at aliases directly.
msg25931 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2005-08-06 12:49
Logged In: YES 

Martin, I don't see any problem with putting the complete
list of aliases into the documentation.

liturgist, don't worry about hard-coding things into the
script. The extra information Martin gave in the table is
not likely going to become part of the standard lib, because
there's no a lot you can do with it programmatically.
msg25932 - (view) Author: liturgist (liturgist) Date: 2005-08-10 21:29
Logged In: YES 

The script attached generates two HTML tables in files
specified on the command line.


A static list of codecs in this script is used because the
language description is not available in the python runtime.
 Codecs found in the encodings.aliases.aliases list are
added to the list, but will be described as "unknown" encodings.

The "bijectiveType" was, like the language descriptions,
taken from the current (2.4.1) documentation.

It would be much better for the descriptions and "bijective"
type settings to come from the runtime.  The problem is one
of maintenance.  Without these available for introspection
in the runtime, a new encoding with no alias will never be
identified.  When it does appear with an alias, it can only
be described as "unknown."
msg25933 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2005-08-10 21:59
Logged In: YES 

I do see a problem with generating these tables
automatically. It suggests the reader that the aliases are
all equally relevant. However, I bet few people have ever
heard of or used, say, 'cspc850multilingual'.

As for the actual patch: Please don't generate HTML.
Instead, TeX should be generated, as this is the primary
source. Also please add a patch to the current TeX file,
updating it appropriately.
msg25934 - (view) Author: liturgist (liturgist) Date: 2005-08-11 02:54
Logged In: YES 

For example: there appears to be a codec for iso8859-1, but
it has no alias in the encodings.aliases.aliases list and it
is not in the current documentation.

What is the relationship of iso8859_1 to latin_1?  Should
iso8859_1 be considered a base codec?  When should iso8859_1
be used rather than latin_1?
msg25935 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2005-08-11 05:56
Logged In: YES 

I think the presence of is a bug, resulting
from automatic generation of these files. The file should be
deleted; iso8859-1 should be encoded through the alias to
latin-1. Thanks for pointing that out.
msg25936 - (view) Author: liturgist (liturgist) Date: 2005-08-11 14:31
Logged In: YES 

If it does not present a problem, making latin_1 and alias
for iso8859_1 as the base codec would present the ISO
standards as a complete, orthogonal set.  The alias would
mean that no existing code is broken.  Right?

Would this approach present any problem?   Should this
become a separate bug entry?
msg25937 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2005-08-11 22:10
Logged In: YES 

It does present a problem: the latin-1 codec is faster than
the iso8859-1 codec, as it is a special case in C (employin
the fact that Latin-1 and Unicode share the first 256 code
points). So I think the iso8859-1 should be dropped. But, as
you guess, this is an issue independent of the documentation
issue at hand, and should be reported (and resolved) separately.
Date User Action Args
2020-09-19 19:04:08georg.brandlsetnosy: - georg.brandl
2010-08-21 18:46:24BreamoreBoysetassignee: georg.brandl -> docs@python

nosy: + docs@python
versions: + Python 3.2, - Python 2.7
2009-04-05 17:27:54georg.brandlsetassignee: georg.brandl

nosy: + georg.brandl
2009-02-16 00:46:03ajaksu2setpriority: normal -> low
type: enhancement
versions: + Python 2.7, - Python 2.4
2005-08-01 18:23:30liturgistcreate