Message 71083 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	gvanrossum
Recipients	gvanrossum, janssen, jimjjewett, loewis, mgiuca, orsenthil, pitrou, thomaspinckney3
Date	2008-08-13.14:50:27
SpamBayes Score	0.0
Marked as misclassified	No
Message-id	<ca471dc20808130750s66654b18ra03b809cd3339117@mail.gmail.com>
In-reply-to	<1218637525.02.0.262767911166.issue3300@psf.upfronthosting.co.za>

Content
On Wed, Aug 13, 2008 at 7:25 AM, Matt Giuca <report@bugs.python.org> wrote: >> I have no strong opinion on the very remaining points you listed, >> except that IMHO encode_rfc2231 with charset=None should not try to >> use UTF8 by default. But someone with more mail protocol skills >> should comment :) > > OK I've come to the realization that DEMANDING ascii (and erroring on > non-ASCII chars) is better for the short term anyway, because we can > always decide later to relax the restrictions, but it's a lot worse to > add restrictions later. So I agree now, should be ASCII. And no, I don't > have mail protocol skills. OK. > The same goes for unquote accepting bytes. We can decide to make it > accept bytes later, but can't remove that feature later, so it's best > (IMHO) to let it NOT accept bytes (which is the current behaviour). OK. >> The bytes > 127 would be translated as themselves; this follows >> logically from how stuff is parsed -- %% and %FF are translated, >> everything else is not. But I don't really care, I doubt there's a >> need. > > Ah but what about unquote (to string)? If it accepted bytes then it > would be a bytes->str operation, and then you need a policy on DEcoding > those bytes. It makes things too complex I think. OK. >> I believe patch 9 still has errors defaulting to strict for quote(). >> Weren't you going to change that? > > I raised it as a concern, but I thought you overruled on that, so I left > it as errors='strict'. What do you want it to be? 'replace'? Now that > this issue has been fully discussed, I'm happy with whatever you decide. I'm OK with replace for unquote(), your point that bogus data is better than an exception is well taken, especially since there are calls that the app can't control (like in cgi.py). For quote() I think strict is better -- it can't fail anyway with UTF8, and if an app passes an explicit conversion it'd be pretty stupid to pass a string that can't be converted with that encoding (since it's presumably the app that generates both the string and the encoding) so it's better to complain there, just like if they made the encode() call themselves with only an encoding specified. This means we have a useful analogy: quote(s, e) == quote(s.encode(e)). >> From looking at it briefly I >> worry that the implementation is pretty slow -- a method call for each >> character and a map() call sounds pretty bad. > > Yes, it does sound pretty bad. However, that's the current way of doing > things in both 2.x and 3.x; I didn't change it (though it looks like I > changed a LOT, I really did try to change as little as possible!) Actually, while the Quoter class (the immediat subject of my scorn) was there before your patch in 3.0, it isn't there in 2.x; somebody must have added it in 3.0 as part of the conversion to Unicode or perhaps as part of the restructuring of urllib.py. The 2.x code maps the __getitem__ of a dict over the string, which is much faster. I think we can do much better than mapping a method call. > Assuming it wasn't made _slower_ than before, can we ignore existing > performance issues and treat them as a separate matter (and can be dealt > with after 3.0)? Now that you've spent so much time with this patch, can't you think of a faster way of doing this? I wonder if mapping a defaultdict wouldn't work. > I'm not putting up a new patch now. The only fix I'd make is to add > Antoine's "or 'ascii'" to email/utils.py, as suggested on the review > tracker. I'll make this change along with any other recommendations > after your review. > > (That is Lib/email/utils.py line 222 becomes: > s = urllib.parse.quote(s, safe='', encoding=charset or 'ascii') > ) > > btw this Rietveld is amazing. I'm assuming I don't have permission to > upload patches there (can't find any button to do so) which is why I > keep posting them here and letting you upload to Rietveld ... Thanks! You can't upload patches to the issue that I created, but a better way would be to create a new issue and assign it to me for review. That will work as long as you have a gmail account or a Google Account. I highly recommend using the upload.py script, which you can download from codereview.appspot.com/static/upload.py. (There's also a link to it on the Create Issue page, at the bottom.) I am hoping that in general we will be able to use Rietveld to review patches instead of the bug tracker.

On Wed, Aug 13, 2008 at 7:25 AM, Matt Giuca <report@bugs.python.org> wrote:
>> I have no strong opinion on the very remaining points you listed,
>> except that IMHO encode_rfc2231 with charset=None should not try to
>> use UTF8 by default. But someone with more mail protocol skills
>> should comment :)
>
> OK I've come to the realization that DEMANDING ascii (and erroring on
> non-ASCII chars) is better for the short term anyway, because we can
> always decide later to relax the restrictions, but it's a lot worse to
> add restrictions later. So I agree now, should be ASCII. And no, I don't
> have mail protocol skills.

OK.

> The same goes for unquote accepting bytes. We can decide to make it
> accept bytes later, but can't remove that feature later, so it's best
> (IMHO) to let it NOT accept bytes (which is the current behaviour).

OK.

>> The bytes > 127 would be translated as themselves; this follows
>> logically from how stuff is parsed -- %% and %FF are translated,
>> everything else is not. But I don't really care, I doubt there's a
>> need.
>
> Ah but what about unquote (to string)? If it accepted bytes then it
> would be a bytes->str operation, and then you need a policy on DEcoding
> those bytes. It makes things too complex I think.

OK.

>> I believe patch 9 still has errors defaulting to strict for quote().
>> Weren't you going to change that?
>
> I raised it as a concern, but I thought you overruled on that, so I left
> it as errors='strict'. What do you want it to be? 'replace'? Now that
> this issue has been fully discussed, I'm happy with whatever you decide.

I'm OK with replace for unquote(), your point that bogus data is
better than an exception is well taken, especially since there are
calls that the app can't control (like in cgi.py).

For quote() I think strict is better -- it can't fail anyway with
UTF8, and if an app passes an explicit conversion it'd be pretty
stupid to pass a string that can't be converted with that encoding
(since it's presumably the app that generates both the string and the
encoding) so it's better to complain there, just like if they made the
encode() call themselves with only an encoding specified. This means
we have a useful analogy: quote(s, e) == quote(s.encode(e)).

>> From looking at it briefly I
>> worry that the implementation is pretty slow -- a method call for each
>> character and a map() call sounds pretty bad.
>
> Yes, it does sound pretty bad. However, that's the current way of doing
> things in both 2.x and 3.x; I didn't change it (though it looks like I
> changed a LOT, I really did try to change as little as possible!)

Actually, while the Quoter class (the immediat subject of my scorn)
was there before your patch in 3.0, it isn't there in 2.x; somebody
must have added it in 3.0 as part of the conversion to Unicode or
perhaps as part of the restructuring of urllib.py. The 2.x code maps
the __getitem__ of a dict over the string, which is much faster. I
think we can do much better than mapping a method call.

> Assuming it wasn't made _slower_ than before, can we ignore existing
> performance issues and treat them as a separate matter (and can be dealt
> with after 3.0)?

Now that you've spent so  much time with this patch, can't you think
of a faster way of doing this? I wonder if mapping a defaultdict
wouldn't work.

> I'm not putting up a new patch now. The only fix I'd make is to add
> Antoine's "or 'ascii'" to email/utils.py, as suggested on the review
> tracker. I'll make this change along with any other recommendations
> after your review.
>
> (That is Lib/email/utils.py line 222 becomes:
> s = urllib.parse.quote(s, safe='', encoding=charset or 'ascii')
> )
>
> btw this Rietveld is amazing. I'm assuming I don't have permission to
> upload patches there (can't find any button to do so) which is why I
> keep posting them here and letting you upload to Rietveld ...

Thanks! You can't upload patches to the issue that *I* created, but a
better way would be to create a new issue and assign it to me for
review. That will work as long as you have a gmail account or a Google
Account. I highly recommend using the upload.py script, which you can
download from codereview.appspot.com/static/upload.py. (There's also a
link to it on the Create Issue page, at the bottom.)

I am hoping that in general we will be able to use Rietveld to review
patches instead of the bug tracker.

History
Date	User	Action	Args
2008-08-13 14:50:31	gvanrossum	set	recipients: + gvanrossum, loewis, jimjjewett, janssen, orsenthil, pitrou, thomaspinckney3, mgiuca
2008-08-13 14:50:30	gvanrossum	link	issue3300 messages
2008-08-13 14:50:28	gvanrossum	create