Message 180439 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	terry.reedy
Recipients	arjennienhuis, benjamin.peterson, christian.heimes, eric.smith, exarkun, ezio.melotti, glyph, gvanrossum, loewis, martin.panter, pitrou, serhiy.storchaka, terry.reedy, uau, vstinner
Date	2013-01-22.23:34:32
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1358897672.74.0.323995453244.issue3982@psf.upfronthosting.co.za>
In-reply-to

Content
>it would probably be reasonable to make these protocols use str objects at the heart, and only convert to bytes after the formatting is done. I presume this would mean adding 'if py3: out = out.encode()' after the formatting. As I said before, this works much better in 3.3+ than in 3.2-. Some actual numbers: for len in (0, 100, 1000, 10000, 100000): a = 'a' * len print(timeit("a.encode()", "from __main__ import a")) >>> 0.19305401378265558 0.22193721412302575 0.2783227054755883 0.677596406192696 7.124387897799184 Given n = 1000000, these should be microseconds per encoding. Of note: the copying of bytes does not double the total time until there are a few thousand chars. Would protocols be using .format for much more than this? [If speed is really an issue, we could make binary file/socket write methods unicode implementation aware. They could directly access the ascii (or latin-1) bytes in a unicode object, just as they do with a bytes object, and the extra copy could be skipped.]

>it would probably be reasonable to make these protocols use str objects at the heart, and only convert to bytes after the formatting is done.

I presume this would mean adding 'if py3: out = out.encode()' after the formatting. As I said before, this works much better in 3.3+ than in 3.2-. Some actual numbers:

for len in (0, 100, 1000, 10000, 100000):
    a = 'a' * len
    print(timeit("a.encode()", "from __main__ import a"))
>>> 
0.19305401378265558
0.22193721412302575
0.2783227054755883
0.677596406192696
7.124387897799184

Given n = 1000000, these should be microseconds per encoding. Of note: 
the copying of bytes does not double the total time until there are a few thousand chars. Would protocols be using .format for much more than this?

[If speed is really an issue, we could make binary file/socket write methods unicode implementation aware. They could directly access the ascii (or latin-1) bytes in a unicode object, just as they do with a bytes object, and the extra copy could be skipped.]

History
Date	User	Action	Args
2013-01-22 23:34:32	terry.reedy	set	recipients: + terry.reedy, gvanrossum, loewis, exarkun, pitrou, vstinner, eric.smith, christian.heimes, benjamin.peterson, glyph, ezio.melotti, arjennienhuis, uau, martin.panter, serhiy.storchaka
2013-01-22 23:34:32	terry.reedy	set	messageid: <1358897672.74.0.323995453244.issue3982@psf.upfronthosting.co.za>
2013-01-22 23:34:32	terry.reedy	link	issue3982 messages
2013-01-22 23:34:32	terry.reedy	create