Message 187542 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	terry.reedy
Recipients	Tomoki.Imai, ezio.melotti, pradyunsg, r.david.murray, roger.serwy, terry.reedy
Date	2013-04-22.01:26:50
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1366594012.32.0.0356144768575.issue17348@psf.upfronthosting.co.za>
In-reply-to

Content
When discussing problematical behavior, one should specify OS and exact Python version, including bugfix number. If at all possible, one should use the latest bugfix release with all released bugfixes. 2.7.3 came out 10+ months before the original report. I do not presume without evidence that it has the same behavior as the 2.7.2. The recently released 2.7.4 has another year of bugfixes, so it might also behave differently. Looking again at the original report, I see that the false issue of lost encoding obscured to me a real problem: ord(u'€') is 8364, not 128. Does 2.7.4 make the same error for that input? What does it do with u"こんにちは"? (Note, on the Windows console, both keying and viewing unicode chars is problematical, apparently more so that with the nix consoles. If I could not paste, u"こんにちは", I would most likely just key u'\u3053\u3093\u306b\u3061\u306f'.) I believe the underlying problem is that a Python 2 program is a stream of bytes while a Python 3 program is a stream of unicode codepoints. So in Python 2, a unicode literal has to be encoded to bytes before being decoded back to unicode codepoints in a unicode string object. David, I presume this is why you say we cannot just toss out the encoding to bytes. I presume that you are also suggesting that the encoding and subsequent decoding are done with different codecs because of locale issues. Might IOBinding.encoding be miscalculated? For ascii codepoints, the encoding and decoding is typically a null operation. This means that \u#### escapes, as opposed to non-ascii codepoints, should not get mangled before being interpreted during the creation of the unicode object. Using such escapes is one solution to the problem. Another is to use Python 3. That is* the generic answer to many Python 2.x unicode problems. In 3.3.1: >>> u"こんにちは" 'こんにちは' problem solved ;-). In other words, fixing 2.7-only unicode bugs has fairly low priority in general. However, if there is an easy fix here that Roger thinks is safe, it can be applied.

When discussing problematical behavior, one should specify OS and exact Python version, including bugfix number. If at all possible, one should use the latest bugfix release with all released bugfixes. 2.7.3 came out 10+ months before the original report. I do not presume without evidence that it has the same behavior as the 2.7.2. The recently released 2.7.4 has another year of bugfixes, so it might also behave differently.

Looking again at the original report, I see that the false issue of lost encoding obscured to me a real problem: ord(u'€') is 8364, not 128. Does 2.7.4 make the same error for that input? What does it do with u"こんにちは"?

(Note, on the Windows console, both keying and viewing unicode chars is problematical, apparently more so that with the *nix consoles. If I could not paste, u"こんにちは", I would most likely just key u'\u3053\u3093\u306b\u3061\u306f'.)

I believe the underlying problem is that a Python 2 program is a stream of bytes while a Python 3 program is a stream of unicode codepoints. So in Python 2, a unicode literal has to be encoded to bytes before being decoded back to unicode codepoints in a unicode string object.

David, I presume this is why you say we cannot just toss out the encoding to bytes. I presume that you are also suggesting that the encoding and subsequent decoding are done with different codecs because of locale issues. Might IOBinding.encoding be miscalculated?

For ascii codepoints, the encoding and decoding is typically a null operation. This means that \u#### escapes, as opposed to non-ascii codepoints, should not get mangled before being interpreted during the creation of the unicode object. Using such escapes is one solution to the problem.

Another is to use Python 3. That *is* the generic answer to many Python 2.x unicode problems. In 3.3.1:
>>> u"こんにちは"
'こんにちは'
problem solved ;-).

In other words, fixing 2.7-only unicode bugs has fairly low priority in general. However, if there is an easy fix here that Roger thinks is safe, it can be applied.

History
Date	User	Action	Args
2013-04-22 01:26:52	terry.reedy	set	recipients: + terry.reedy, ezio.melotti, roger.serwy, r.david.murray, pradyunsg, Tomoki.Imai
2013-04-22 01:26:52	terry.reedy	set	messageid: <1366594012.32.0.0356144768575.issue17348@psf.upfronthosting.co.za>
2013-04-22 01:26:52	terry.reedy	link	issue17348 messages
2013-04-22 01:26:50	terry.reedy	create