Message 142054 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	ezio.melotti
Recipients	Arfrever, ezio.melotti, mrabarnett, pitrou, r.david.murray, tchrist, terry.reedy
Date	2011-08-14.07:15:07
SpamBayes Score	1.6653345e-16
Marked as misclassified	No
Message-id	<1313306108.95.0.490686038302.issue12729@psf.upfronthosting.co.za>
In-reply-to

Content
> If speed is more important than correctness, I can make any algorithm > infinitely fast. Given the choice between correct and quick, I will > take correct every single time. It's a trade-off. Using non-BMP chars is fairly unusual (many real-world applications hardly use non-ASCII chars). Slowing everything down just to allow non-BMP chars on narrow builds is not a good idea IMHO. Wide builds can be used if one really wants len() and other methods to work properly with non-BMP chars. > Plus your strings our immutable! You know how long they are and they > never change. Correctness comes at a negligible cost. Sure, we can cache the len, but we still have to compute it at least once. Also it's not just len(), but many other operations like slicing that are affected. > Unicode says you can't put surrogates or noncharacters in a > UTF-anything stream. It's a bug to do so and pretend it's a > UTF-whatever. The UTF-8 codec described by RFC 2279 didn't say so, so, since our codec was following RFC 2279, it was producing valid UTF-8. With RFC 3629 a number of things changed in a non-backward compatible way. Therefore we couldn't just change the behavior of the UTF-8 codec nor rename it to something else in Python 2. We had to wait till Python 3 in order to fix it. > Perl has an encoding form, which it does not call "UTF-8", that you > can use the UTF-8 algorithm on for any code point, include > non-characters and surrogates and even non-Unicode code points far > above 0x10_FFFF, up to in fact 0xFFFF_FFFF_FFFF_FFFF on 64-bit > machines. It's the internal format we use in memory. But we don't > call it real UTF-8, either. This sounds like RFC 2279 UTF-8. It allowed up to 6 bytes (following the same encoding scheme) and had no restrictions about surrogates (at the time I think only BMP chars existed, so there were no surrogates and the Unicode consortium didn't decide that the limit was 0x10FFFF). > It sounds like this is the kind of thing that would be useful to you. I believe this is what the surrogateescape error handler does (up to 0x10FFFF). > Why? Create something called utf8-extended or utf8-lax or > utf8-nonstrict or something. But you really can't call it UTF-8 and > do that. That's what we did in Python 3, but on Python 2 is too late to fix it, especially in a point release. (Just to clarify, I don't think any of these things will be fixed in 2.7. There won't be any 2.8, and major changes (especially backwards-incompatible ones) are unlikely to happen in a point release (e.g. 2.7.3), so it's better to focus on Python 3. Minor bug fixes can still be done even in 2.7 though.) > Perl defaults to the relaxed one, which gives warnings not exceptions, > if you do things like setting PERLUNICODE to S or SD and such for the > default I/I encoding. If you actually use "UTF-8" as the encoding on > the stream, though, you get the version that gives exceptions > instead. In Python we don't usually use warnings for this kind of things (also we don't have things like "use strict"). > I don't imagine most of the Python devel team knows Perl very well, > and maybe not even Java or ICU. So I get the idea that there isn't > as much awareness of Unicode in your team as there tends to be in > those others. I would say there are at least 5-10 Unicode "experts" in our team. It might be true though that we don't always follow closely what other languages and the Unicode consortium do, but if people reports problem we are willing to fix them (so thanks for reporting them!). > From my point of view, learning from other people's mistakes is a way > to get ahead without incurring all the learning-bumps oneself, so if > there's a way to do that for you, that could be to your benefit, and > I'm very happy to share some of our blunders so you can avoid them > yourselves. While I really appreciate the fact that you are sharing with us your experience, the solution found and applied in Perl might not always be the best one for Python (but it's still good to learn from others' mistakes). For example I don't think removing the 0x10FFFF upper limit is going to happen -- even if it might be useful for other things. Also regular expressions are not part of the core and are not used that often, so I consider problems with narrow/wide builds, codecs and the unicode type much more important than problems with the re/regex module (they should be fixed too, but have lower priority IMHO).

> If speed is more important than correctness, I can make any algorithm
> infinitely fast.  Given the choice between correct and quick, I will 
> take correct every single time.

It's a trade-off.  Using non-BMP chars is fairly unusual (many real-world applications hardly use non-ASCII chars).  Slowing everything down just to allow non-BMP chars on narrow builds is not a good idea IMHO.  Wide builds can be used if one really wants len() and other methods to work properly with non-BMP chars.

> Plus your strings our immutable! You know how long they are and they 
> never change.  Correctness comes at a negligible cost. 

Sure, we can cache the len, but we still have to compute it at least once.  Also it's not just len(), but many other operations like slicing that are affected.

> Unicode says you can't put surrogates or noncharacters in a 
> UTF-anything stream.  It's a bug to do so and pretend it's a 
> UTF-whatever.

The UTF-8 codec described by RFC 2279 didn't say so, so, since our codec was following RFC 2279, it was producing valid UTF-8.  With RFC 3629 a number of things changed in a non-backward compatible way.  Therefore we couldn't just change the behavior of the UTF-8 codec nor rename it to something else in Python 2.  We had to wait till Python 3 in order to fix it.

> Perl has an encoding form, which it does not call "UTF-8", that you
> can use the UTF-8 algorithm on for any code point, include 
> non-characters and surrogates and even non-Unicode code points far
> above 0x10_FFFF, up to in fact 0xFFFF_FFFF_FFFF_FFFF on 64-bit 
> machines.  It's the internal format we use in memory.  But we don't
> call it real UTF-8, either.

This sounds like RFC 2279 UTF-8.  It allowed up to 6 bytes (following the same encoding scheme) and had no restrictions about surrogates (at the time I think only BMP chars existed, so there were no surrogates and the Unicode consortium didn't decide that the limit was 0x10FFFF).

> It sounds like this is the kind of thing that would be useful to you.

I believe this is what the surrogateescape error handler does (up to 0x10FFFF).

> Why?  Create something called utf8-extended or utf8-lax or 
> utf8-nonstrict or something.  But you really can't call it UTF-8 and 
> do that. 

That's what we did in Python 3, but on Python 2 is too late to fix it, especially in a point release.  (Just to clarify, I don't think any of these things will be fixed in 2.7.  There won't be any 2.8, and major changes (especially backwards-incompatible ones) are unlikely to happen in a point release (e.g. 2.7.3), so it's better to focus on Python 3.  Minor bug fixes can still be done even in 2.7 though.)

> Perl defaults to the relaxed one, which gives warnings not exceptions,
> if you do things like setting PERLUNICODE to S or SD and such for the
> default I/I encoding.  If you actually use "UTF-8" as the encoding on 
> the stream, though, you get the version that gives exceptions 
> instead.

In Python we don't usually use warnings for this kind of things (also we don't have things like "use strict").

> I don't imagine most of the Python devel team knows Perl very well,
> and maybe not even Java or ICU.  So I get the idea that there isn't 
> as much awareness of Unicode in your team as there tends to be in
> those others.

I would say there are at least 5-10 Unicode "experts" in our team.  It might be true though that we don't always follow closely what other languages and the Unicode consortium do, but if people reports problem we are willing to fix them (so thanks for reporting them!).

> From my point of view, learning from other people's mistakes is a way
> to get ahead without incurring all the learning-bumps oneself, so if
> there's a way to do that for you, that could be to your benefit, and 
> I'm very happy to share some of our blunders so you can avoid them
> yourselves.

While I really appreciate the fact that you are sharing with us your experience, the solution found and applied in Perl might not always be the best one for Python (but it's still good to learn from others' mistakes).
For example I don't think removing the 0x10FFFF upper limit is going to happen -- even if it might be useful for other things.
Also regular expressions are not part of the core and are not used that often, so I consider problems with narrow/wide builds, codecs and the unicode type much more important than problems with the re/regex module (they should be fixed too, but have lower priority IMHO).

History
Date	User	Action	Args
2011-08-14 07:15:09	ezio.melotti	set	recipients: + ezio.melotti, terry.reedy, pitrou, mrabarnett, Arfrever, r.david.murray, tchrist
2011-08-14 07:15:08	ezio.melotti	set	messageid: <1313306108.95.0.490686038302.issue12729@psf.upfronthosting.co.za>
2011-08-14 07:15:08	ezio.melotti	link	issue12729 messages
2011-08-14 07:15:07	ezio.melotti	create