This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: iso-xxx/cp1252 inconsistencies in Python 2.* not in 3.*
Type: Stage:
Components: Versions: Python 2.5
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: jmfauth, loewis, rpetrov, vstinner
Priority: normal Keywords:

Created on 2008-09-29 09:57 by jmfauth, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (6)
msg74017 - (view) Author: jmf (jmfauth) Date: 2008-09-29 09:57
XP SP2 fr_CH cp1252

I have always found, there are some inconsistencies in the Python <=2.5
serie regarding the char endodings, especially the iso-8859-1, cp1252,
iso-8859-15 encodings.

I do not know if this must be considered as a bug or as a feature.
Python is quite friendly with these encodings. It may not be a problem
for a daily work, it is more acute when one wish to teach the
chararacter encodings.

char "œ": "code point" 156 in cp1252
char "€": "code point" 128 in cp1252

Python 2.5.2

>>> unicode('œ', 'cp1252')
u'\u0153'
>>> unicode('€', 'cp1252')
u'\u20ac'
>>> unicode('œ', 'iso-8859-15')
u'\x9c'
>>> unicode('€', 'iso-8859-15')
u'\x80'
>>> unicode('€', 'iso-8859-1') #***
u'\x80'
>>> unicode('œ', 'iso-8859-1') #***
u'\x9c'
>>> #*** should raise an error since œ and €
>>> #are not existing in an iso-8859-1 table.
>>> 

It looks like iso-8859-1 behaves as iso-8859-15 (typo somewhere?)

Python 3.0 rc1 does the job correctly and notices the difference

>>> bytes('œ', 'cp1252')
b'\x9c'
>>> bytes('€', 'cp1252')
b'\x80'
>>> bytes('œ', 'iso-8859-15')
b'\xbd'
>>> bytes('€', 'iso-8859-15')
b'\xa4'
>>> bytes('œ', 'iso-8859-1')
Traceback (most recent call last):
  File "<pyshell#5>", line 1, in <module>
    bytes('œ', 'iso-8859-1')
UnicodeEncodeError: 'latin-1' codec can't encode character '\u0153' in
position 0: ordinal not in range(256)
>>> bytes('€', 'iso-8859-1')
Traceback (most recent call last):
  File "<pyshell#6>", line 1, in <module>
    bytes('€', 'iso-8859-1')
UnicodeEncodeError: 'latin-1' codec can't encode character '\u20ac' in
position 0: ordinal not in range(256)
>>> # these errors are expected
>>> 

Python 2.6**

The latest version is not installed. If I recall correcly, 2.6b* are
presenting the same issue as in 2.5.2 .
msg74018 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2008-09-29 10:14
If you write "€" in the Python interpreter (Python2), you will get a 
*bytes* string encoded in your terminal charset. Example on Linux 
(utf-8):

Python 2.5.1 (r251:54863, Jul 31 2008, 23:17:40)
>>> '€'
'\xe2\x82\xac'

Use "u" prefix to get unicode string:

Python 2.5.1 (r251:54863, Jul 31 2008, 23:17:40)
>>> u'€'
u'\u20ac'

If you use unicode, encoding to ISO-8859-1/-15 works correctly. 
(Truncated) example with python trunk:

Python 2.6rc2+ (trunk:66680M, Sep 29 2008, 12:03:32)
>>> u'€'.encode('ISO-8859-1')
...
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u20ac'
>>> u'€'.encode('ISO-8859-15')
'\xa4'

In a script (Python code written in a file), use #coding header to 
specify your file charset. Or use "\xXX", "\uXXXX" and "\UXXXX" 
notations for non-ASCII characters.

Is there somewhere an Unicode Python FAQ? :-)
msg74049 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-09-29 21:30
>>>> unicode('€', 'iso-8859-15')
> u'\x80'
>>>> unicode('€', 'iso-8859-1') #***
> u'\x80'
> 
> It looks like iso-8859-1 behaves as iso-8859-15 (typo somewhere?)

That's correct, and intentional. iso-8850-1 and iso-8859-15 are
*indeed* the same for the respective code points (i.e \x80 and
\x9c).
msg74095 - (view) Author: Roumen Petrov (rpetrov) * Date: 2008-09-30 18:35
I don't know iso codeset that define characters in code range 0x80 0x9f.
This range is reserved for control symbols.

The code of euro is 0xa4 in iso-8859-15. Also changes include symbols
like 1/2, 3/4 and I forgot other differences.
msg74097 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-09-30 20:16
> I don't know iso codeset that define characters in code range 0x80 0x9f.

That's not true. In ISO-8859-1 (atleast, in the IANA charset), these
characters are indeed assigned - for control functions. So the
ISO-8859-1 byte \x80 corresponds to the Unicode character U+0080,
likewise for \x9c and U+009c. ISO-8859-1 and ISO-8859-15 do have
differences, but *not* for the range 0x80..0x9f - they are identical
in that range (namely, referring to control characters).
msg74141 - (view) Author: Roumen Petrov (rpetrov) * Date: 2008-10-01 19:38
Thanks Martin for correction: yes not reserved - assigned.

Jean-Michel, you test case is incorrect. You terminal is run in CP1252
where byte \x80 is shown as euro sing. But if you run terminal(if is
possible in reported operating system) in ISO-8859-{1|15} this byte is a
control character without visual representation.

To avoid "visual" ambiguity you should use hex representation of characters:
1) >>> unicode('\x80', 'cp1252')
u'\u20ac'
2) >>> unicode('\xa4','iso-8859-15')
u'\u20ac'

For the second case in you terminal you should enter:
>>> unicode('¤','iso-8859-15')
u'\u20ac'
'¤'(cp1252) = \xa4 = 164 = '€'(iso-8859-15)

I guess you understand what is incorrect in you report.
History
Date User Action Args
2022-04-11 14:56:39adminsetgithub: 48245
2008-10-01 19:38:45rpetrovsetmessages: + msg74141
2008-09-30 20:16:30loewissetmessages: + msg74097
title: iso-xxx/cp1252 inconsistencies in Python 2.* not in 3.* -> iso-xxx/cp1252 inconsistencies in Python 2.* not in 3.*
2008-09-30 18:35:56rpetrovsetnosy: + rpetrov
messages: + msg74095
2008-09-29 21:30:26loewissetnosy: + loewis
messages: + msg74049
title: iso-xxx/cp1252 inconsistencies in Python 2.* not in 3.* -> iso-xxx/cp1252 inconsistencies in Python 2.* not in 3.*
2008-09-29 12:51:16georg.brandlsetstatus: open -> closed
resolution: not a bug
2008-09-29 10:14:30vstinnersetnosy: + vstinner
messages: + msg74018
2008-09-29 09:57:02jmfauthcreate