Issue 3995: iso-xxx/cp1252 inconsistencies in Python 2.* not in 3.*

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/48245

classification

Title:	iso-xxx/cp1252 inconsistencies in Python 2.* not in 3.*
Type:		Stage:
Components:		Versions:	Python 2.5

process

Status:	closed	Resolution:	not a bug
Dependencies:		Superseder:
Assigned To:		Nosy List:	jmfauth, loewis, rpetrov, vstinner
Priority:	normal	Keywords:

Created on 2008-09-29 09:57 by jmfauth, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (6)
msg74017 - (view)	Author: jmf (jmfauth)	Date: 2008-09-29 09:57
XP SP2 fr_CH cp1252 I have always found, there are some inconsistencies in the Python <=2.5 serie regarding the char endodings, especially the iso-8859-1, cp1252, iso-8859-15 encodings. I do not know if this must be considered as a bug or as a feature. Python is quite friendly with these encodings. It may not be a problem for a daily work, it is more acute when one wish to teach the chararacter encodings. char "œ": "code point" 156 in cp1252 char "€": "code point" 128 in cp1252 Python 2.5.2 >>> unicode('œ', 'cp1252') u'\u0153' >>> unicode('€', 'cp1252') u'\u20ac' >>> unicode('œ', 'iso-8859-15') u'\x9c' >>> unicode('€', 'iso-8859-15') u'\x80' >>> unicode('€', 'iso-8859-1') #* u'\x80' >>> unicode('œ', 'iso-8859-1') #* u'\x9c' >>> #* should raise an error since œ and € >>> #are not existing in an iso-8859-1 table. >>> It looks like iso-8859-1 behaves as iso-8859-15 (typo somewhere?) Python 3.0 rc1 does the job correctly and notices the difference >>> bytes('œ', 'cp1252') b'\x9c' >>> bytes('€', 'cp1252') b'\x80' >>> bytes('œ', 'iso-8859-15') b'\xbd' >>> bytes('€', 'iso-8859-15') b'\xa4' >>> bytes('œ', 'iso-8859-1') Traceback (most recent call last): File "<pyshell#5>", line 1, in <module> bytes('œ', 'iso-8859-1') UnicodeEncodeError: 'latin-1' codec can't encode character '\u0153' in position 0: ordinal not in range(256) >>> bytes('€', 'iso-8859-1') Traceback (most recent call last): File "<pyshell#6>", line 1, in <module> bytes('€', 'iso-8859-1') UnicodeEncodeError: 'latin-1' codec can't encode character '\u20ac' in position 0: ordinal not in range(256) >>> # these errors are expected >>> Python 2.6 The latest version is not installed. If I recall correcly, 2.6b* are presenting the same issue as in 2.5.2 .
msg74018 - (view)	Author: STINNER Victor (vstinner) *	Date: 2008-09-29 10:14
If you write "€" in the Python interpreter (Python2), you will get a bytes string encoded in your terminal charset. Example on Linux (utf-8): Python 2.5.1 (r251:54863, Jul 31 2008, 23:17:40) >>> '€' '\xe2\x82\xac' Use "u" prefix to get unicode string: Python 2.5.1 (r251:54863, Jul 31 2008, 23:17:40) >>> u'€' u'\u20ac' If you use unicode, encoding to ISO-8859-1/-15 works correctly. (Truncated) example with python trunk: Python 2.6rc2+ (trunk:66680M, Sep 29 2008, 12:03:32) >>> u'€'.encode('ISO-8859-1') ... UnicodeEncodeError: 'latin-1' codec can't encode character u'\u20ac' >>> u'€'.encode('ISO-8859-15') '\xa4' In a script (Python code written in a file), use #coding header to specify your file charset. Or use "\xXX", "\uXXXX" and "\UXXXX" notations for non-ASCII characters. Is there somewhere an Unicode Python FAQ? :-)
msg74049 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2008-09-29 21:30
>>>> unicode('€', 'iso-8859-15') > u'\x80' >>>> unicode('€', 'iso-8859-1') #*** > u'\x80' > > It looks like iso-8859-1 behaves as iso-8859-15 (typo somewhere?) That's correct, and intentional. iso-8850-1 and iso-8859-15 are indeed the same for the respective code points (i.e \x80 and \x9c).
msg74095 - (view)	Author: Roumen Petrov (rpetrov) *	Date: 2008-09-30 18:35
I don't know iso codeset that define characters in code range 0x80 0x9f. This range is reserved for control symbols. The code of euro is 0xa4 in iso-8859-15. Also changes include symbols like 1/2, 3/4 and I forgot other differences.
msg74097 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2008-09-30 20:16
> I don't know iso codeset that define characters in code range 0x80 0x9f. That's not true. In ISO-8859-1 (atleast, in the IANA charset), these characters are indeed assigned - for control functions. So the ISO-8859-1 byte \x80 corresponds to the Unicode character U+0080, likewise for \x9c and U+009c. ISO-8859-1 and ISO-8859-15 do have differences, but not for the range 0x80..0x9f - they are identical in that range (namely, referring to control characters).
msg74141 - (view)	Author: Roumen Petrov (rpetrov) *	Date: 2008-10-01 19:38
Thanks Martin for correction: yes not reserved - assigned. Jean-Michel, you test case is incorrect. You terminal is run in CP1252 where byte \x80 is shown as euro sing. But if you run terminal(if is possible in reported operating system) in ISO-8859-{1\|15} this byte is a control character without visual representation. To avoid "visual" ambiguity you should use hex representation of characters: 1) >>> unicode('\x80', 'cp1252') u'\u20ac' 2) >>> unicode('\xa4','iso-8859-15') u'\u20ac' For the second case in you terminal you should enter: >>> unicode('¤','iso-8859-15') u'\u20ac' '¤'(cp1252) = \xa4 = 164 = '€'(iso-8859-15) I guess you understand what is incorrect in you report.

History
Date	User	Action	Args
2022-04-11 14:56:39	admin	set	github: 48245
2008-10-01 19:38:45	rpetrov	set	messages: + msg74141
2008-09-30 20:16:30	loewis	set	messages: + msg74097 title: iso-xxx/cp1252 inconsistencies in Python 2.* not in 3.* -> iso-xxx/cp1252 inconsistencies in Python 2.* not in 3.*
2008-09-30 18:35:56	rpetrov	set	nosy: + rpetrov messages: + msg74095
2008-09-29 21:30:26	loewis	set	nosy: + loewis messages: + msg74049 title: iso-xxx/cp1252 inconsistencies in Python 2.* not in 3.* -> iso-xxx/cp1252 inconsistencies in Python 2.* not in 3.*
2008-09-29 12:51:16	georg.brandl	set	status: open -> closed resolution: not a bug
2008-09-29 10:14:30	vstinner	set	nosy: + vstinner messages: + msg74018
2008-09-29 09:57:02	jmfauth	create