classification
Title: support .format for bytes
Type: enhancement Stage:
Components: Interpreter Core Versions: Python 3.5
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Arfrever, arjennienhuis, barry, benjamin.peterson, brett.cannon, christian.heimes, durin42, ecir.hana, eric.smith, exarkun, ezio.melotti, flox, glyph, gregory.p.smith, haypo, loewis, nlevitt@gmail.com, pitrou, serhiy.storchaka, stendec, terry.reedy, tshepang, uau, vadmium
Priority: normal Keywords:

Created on 2008-09-27 15:50 by benjamin.peterson, last changed 2014-01-04 06:45 by tshepang.

Files
File name Uploaded Description Edit
byte_format.py terry.reedy, 2013-10-09 00:13 Imitate str.format with bytes function
Messages (88)
msg73931 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2008-09-27 15:50
I just working on porting some networking code from 2.x to 3.x and it
heavily uses string formatting. Since bytes don't support any kind of
formatting, it's becoming tedious and inelegant to do it with "+". Can
.format be supported in bytes?

[I understand format is implemented with stringlib so shouldn't it be
fairly easy to implement?]
msg73935 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2008-09-27 17:33
Yes, it would be easy to add. Maybe bring this up on python-dev (or
python-3000) to get consensus?

Are we in feature freeze for 3.0?
msg73936 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2008-09-27 17:35
On Sat, Sep 27, 2008 at 12:33 PM, Eric Smith <report@bugs.python.org> wrote:
>
> Eric Smith <eric@trueblade.com> added the comment:
>
> Yes, it would be easy to add. Maybe bring this up on python-dev (or
> python-3000) to get consensus?

Yes, that will have to be done.
>
> Are we in feature freeze for 3.0?

Unfortunately, yes.
msg73937 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-09-27 17:35
I'm skeptical. What networking code specifically are you using, and what
specifically does it use string formatting for?
msg73938 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2008-09-27 17:39
On Sat, Sep 27, 2008 at 12:35 PM, Martin v. Löwis
<report@bugs.python.org> wrote:
>
> Martin v. Löwis <martin@v.loewis.de> added the comment:
>
> I'm skeptical. What networking code specifically are you using, and what
> specifically does it use string formatting for?

I'm working on the tests for ftplib. [1] The dummy server uses string
formatting to build responses.

[1] http://svn.python.org/view/python/trunk/Lib/test/test_ftplib.py?view=markup
msg73939 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-09-27 18:42
> I'm working on the tests for ftplib. [1] The dummy server uses string
> formatting to build responses.

I see. I propose to add a method push_string, defined as

  def push_string(self, s):
      self.push(s.encode("ascii")

In FTP, the responses are, by definition, ASCII-encoded strings.
The proper way to generate them is to make a string, then encode it.
msg74019 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2008-09-29 10:22
I don't think that b'...'.format() is a good idea. Programmers will 
continue to mix characters and bytes since .format() target are 
characters.
msg74021 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2008-09-29 10:50
> I don't think that b'...'.format() is a good idea. Programmers
> will continue to mix characters and bytes since .format() target
> are characters.

b''.format() would return bytes, not a string. This is also how it works
in 2.6.

I'm also not sold on implementing it, although it would be easy and I
can see a few uses for it. I think Martin's suggesting of encoding back
to ascii might be the best thing to do (that is, don't implement
b''.format()).
msg74022 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2008-09-29 10:56
> I think Martin's suggesting of encoding back to ascii might be 
> the best thing to do

As I understand, you would like to use bytes as characters, like 
b'{code} {message}'.format(code=100, message='OK'). So why no using 
explicit conversion to ASCII? ftp='{code} {message}'.format(code=100, 
message='OK').encode('ASCII').

If you need to work on bytes, it means that you will use the full 
range 0..255 whereas ASCII reject bytes in 128..255.
msg74050 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-09-29 21:33
>> I think Martin's suggesting of encoding back to ascii might be 
>> the best thing to do
> 
> As I understand, you would like to use bytes as characters, like 
> b'{code} {message}'.format(code=100, message='OK'). So why no using 
> explicit conversion to ASCII? ftp='{code} {message}'.format(code=100, 
> message='OK').encode('ASCII').

That's indeed exactly what I had proposed - only that you shouldn't
repeat the .encode('ascii') all over the place, but instead wrap that
into a function (which I proposed to call push_string, along with the
existing .push function.
msg84121 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2009-03-24 23:28
loewis> That's indeed exactly what I had proposed 
loewis> - only that you shouldn't repeat the .encode('ascii') 
loewis>  all over the place, (...)

If you can only use bytes 0..127, it can not used for binary protocols 
and so I don't think that it's really useful. If your protocol is 
ASCII text, use explicit conversion to ASCII.

I also not fan on functions having different result type 
(format->bytes or str, it depends...).
msg84123 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2009-03-24 23:37
> I also not fan on functions having different result type
> (format->bytes or str, it depends...).

In 3.x, str.format() and bytes.format() would be two different methods
on two different objects. I don't think there's any expectation that
they have the same return type. There's no such expectation for
str.strip() and bytes.strip() either.

Similarly, in 2.6, str.format() has a different return type than
unicode.format().

Now the builtin format() function is another issue. In 2.6 the return
type does depend on the types of the arguments. In 3.x, I'd suggest
leaving it as unicode and you won't be allowed to pass in bytes.
msg90421 - (view) Author: Arjen Nienhuis (arjennienhuis) Date: 2009-07-11 13:54
There are many binary formats that use ASCII numbers.

'HTTP chunking' uses ASCII mixed with binary (octets).

With 2.6 you could write:

def chunk(block):
    return b'{0:x}\r\n{1}\r\n'.format(len(block), block)

With 3.0 you'd have to write this:

def chunk(block):
    return format(len(block), 'x').encode('ascii') + b'\r\n' + block +
b'\r\n'

You cannot convert to ascii at the end of the pipeline as there are
bytes > 127 in the data blocks.
msg90423 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2009-07-11 15:52
> def chunk(block):
>     return format(len(block), 'x').encode('ascii') + b'\r\n' + block +
> b'\r\n'
> 
> You cannot convert to ascii at the end of the pipeline as there are
> bytes > 127 in the data blocks.

I wouldn't write it in such a complicated way. Instead, use

def chunk(block):
   return hex(len(block)).encode('ascii') + b'\r\n' + block + b'\r\n'

This doesn't need any format call, and describes adequatly how the
protocol works: send an ASCII-encoded hex length, send CRLF, send
the block, then send another CRLF. Of course, I would probably write
that into the socket right away, rather than copying it into a different
bytes object first.
msg90425 - (view) Author: Arjen Nienhuis (arjennienhuis) Date: 2009-07-11 16:28
> def chunk(block):
>   return hex(len(block)).encode('ascii') + b'\r\n' + block + b'\r\n'

hex(10) returns '0xa' instead of 'a'.

> This doesn't need any format call, and describes adequatly how the
> protocol works: send an ASCII-encoded hex length, send CRLF, send
> the block, then send another CRLF. Of course, I would probably write
> that into the socket right away, rather than copying it into a different
> bytes object first.

The point is that need to convert to ascii for each int that you send.
You cannot just wrap the socket with an encoding. This makes porting
difficult.
msg90428 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2009-07-11 16:47
> hex(10) returns '0xa' instead of 'a'.

Ah, right. So I would still use

   '{0:x}'.format(100).encode("ascii")

rather than the format builtin format function. Actually, I would
probably use

  ('%x' % len(bytes)).encode("ascii")

> The point is that need to convert to ascii for each int that you send.
> You cannot just wrap the socket with an encoding. This makes porting
> difficult.

This I don't understand. What porting becomes more difficult?
From 2.x to 3.x? Why do you have any .format calls in your code that you
want to port - .format was only added in 2.6, so if you want to support
2.x, you surely are not using .format, are you?
msg127210 - (view) Author: Uoti Urpala (uau) Date: 2011-01-27 18:54
This kind of formatting is needed quite often when working on network protocols or file formats, and I think the replies here fail to address important issues. In general you can't encode after formatting, as that doesn't work with binary data, and often it's not appropriate for the low-level routines doing the formatting to know what charset the data is in even if it is text (so it should be fed in already encoded as bytes). The replies from Martin v. Löwis seem to argue that you could use methods other than formatting; that would work almost as well as an argument to remove formatting support from text strings, and IMO cases where formatting is the best option are common.

Here's an example (based on real use but simplified):

template = b"""
stuff here
header1: {}
header2: {}
more stuff
"""

def lowlevel_send(s, b1, b2):  # s socket, b1 and b2 bytes
    s.send(template.format(b1, b2))

To clarify the requirements a bit, the issue is not so much about having a .format method on byte string objects (that's just the most natural-looking way of solving it); the core requirement is to have a formatting operator that can take byte strings as *arguments* and produce byte string *output* where the arguments can be placed unchanged.
msg130215 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2011-03-07 00:47
For future reference, struct.pack, not mentioned here, is a binary bytes formatting function. It can mix ascii bytes with binary octets. It works the same in Python 2 and 3.

Str.bytes does two things: convert objects to strings according to the contents of field specifiers; interpolate the resulting strings into a template string according to the locations of the field specifiers. If desired bytes represent encoded text, then encoding computed text is the obvious Py3 solution.

For some mixed ascii-binary uses, struct.pack is not as elegant as a bytes.format might be. But I think such a method should use struct format codes within field specifiers to convert objects into binary bytes rather than text.
msg130253 - (view) Author: Arjen Nienhuis (arjennienhuis) Date: 2011-03-07 12:34
struct.pack does not work with variable length data. Something like:

b'{0:x}\r\n{1}\r\n'.format(len(block), block)

or

b'%x\r\n%s\r\n' % (len(block), block)

is not possible with struct.pack
msg130284 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2011-03-07 19:09
You are right, I misinterpreted the meaning of 's' without a count (and opened #11436 to clarify). However, for the fairly common case where a variable-length binary block is preceded by a 4 byte *binary* count, one can do something which is not too bad:

>>> block = b'lsfjdlksaj'
>>> n=len(block)
>>> struct.pack('I%ds'%n, n, block)
b'\n\x00\x00\x00lsfjdlksaj'

If leading blanks are acceptable for your example with count as ascii hex digits, one can do something that I admit is worse:

>>> struct.pack('10s%ds2s'%n, ('%8x\r\n'%n).encode(), block, b'\r\n')
b'       a\r\nlsfjdlksaj\r\n'

Of course, for either of these in isolation, I would probably only use .pack for the binary conversion and otherwise use '+' or b''.join(...).
msg163369 - (view) Author: Uoti Urpala (uau) Date: 2012-06-21 21:21
I've hit this limitation a couple more times, and none of the proposed workarounds are adequate. Working with protocols and file formats that use human-readable markup is significantly clumsier than it was with Python 2 (using either the % operator, which also lost its support for byte strings in Python 3, or .format()).

This bug report was closed by its original creator, after early posts where IMO nobody made as good a case for the feature as they could have. Is it possible to reopen this bug or is it necessary to file a new one?

Is there any clear argument AGAINST having .format() for bytes, other than work needed to implement it? Some posts mention "mixing characters and bytes", but I see no reason why this would be much of a real practical concern if it's a method on bytes objects producing bytes output.
msg163379 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2012-06-21 23:41
If you want to discuss this issue further, I think you post to python-ideas list with concrete examples.
msg171791 - (view) Author: Jean-Paul Calderone (exarkun) * (Python committer) Date: 2012-10-02 12:05
Since Benjamin originally requested this feature, and then decided that he could accomplish his desired goal (ftplib porting, as far as I can tell) without it, I think that the "rejected" status is actually incorrect.  I think that Benjamin just wanted to indicate that he no longer needed the feature.  This doesn't mean that no one else will need the feature, and as it turns out the comments seem to reveal that other people do need the feature (also, I need the feature).

So, adjusting the ticket metadata to reflect that this is a valid feature request just waiting for someone to implement it, not a rejected idea that is not welcome in Python.
msg171795 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2012-10-02 12:40
The proposal sounds like a good idea to me.

Benjamin, what needs to be done to implement the feature?
msg171796 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-10-02 13:08
Formatting is a very complicated part of Python (especially after Victor's optimizations). I think no one wants to maintain this code for a long time. The price of maintaining exceeds the potential very limited benefits from the use.
msg171799 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2012-10-02 13:16
I was just logging in to make this point, but Serhiy beat me to it. When I wrote several years ago that this was "easy", it was before the (awesome) PEP 393 work. I suspect, but have not verified, that having a bytes version of this code would now require an implementation that shared very little with the str version.

So I think Martin's advice to just encode to ascii is the best course of action.
msg171800 - (view) Author: Jean-Paul Calderone (exarkun) * (Python committer) Date: 2012-10-02 13:18
> The price of maintaining exceeds the potential very limited benefits from the use.

The "very limited benefits" of being able to write I/O code without roughly 3 times code bloat?  Perhaps for people who don't write code that does non-trivial I/O, but for the rest of us the benefits are pretty significant.

> I suspect, but have not verified, that having a bytes version of this code would now require an implementation that shared very little with the str version.

The implementation may be difficult, therefore no one should attempt it?
msg171801 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2012-10-02 13:22
> The implementation may be difficult, therefore no one should attempt it?

The development cost and maintenance cost is surely part of the evaluation when deciding whether to implement a feature, no?
msg171803 - (view) Author: Jean-Paul Calderone (exarkun) * (Python committer) Date: 2012-10-02 13:38
> The development cost and maintenance cost is surely part of the evaluation when deciding whether to implement a feature, no?

Sure, but in an open source project where almost all contributions are done by volunteers (ie, donated), what is the development cost?
msg171804 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-10-02 13:55
> I suspect, but have not verified, that having a bytes version of this code would now require an implementation that shared very little with the str version.

This is not all. The usage model will be completely different too.

* The default formatting should not use str(), but buffer protocol.
* There is no place for floating point.
* There is no place for locale.
* There is no place for 'r' conversion (possible only for 'a').
* It should include the features of struct.pack(), int.to_bytes() and ctypes.
* Padding should be not only by space, but also by zeros (and possibly by other values).
* Alignment (padding to position divisible by some number).
* In addition to padding and truncating should be the ability to raise an exception in case of discrepancy between the needed and actual lengths.
* It unlikely needed attribute access and indexing.
* Builtin format() should not work with this.

As a result, this should be a completely separate formatting mini-language that has nothing shared with strings formatting. Not worth to introduce bytes.format(), it's just confused. Perhaps you should add features to struct module or add a new module. PyPI looks as good place for such experiments. If people will use it, it could be included in the stdlib.
msg171806 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2012-10-02 14:52
As Serhiy suggests, it would be best to collect th eusecases for a format-like method for bytes and design something which can meet them. It's definitely a PEP.
msg171815 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2012-10-02 16:47
In 3.3+, somestring.encode('ascii') is a small constant-time operation. So for pure ascii *text* bytes, that seems the appropriate 3.x approach.

I agree that something else should be used for binary formatting. Perhaps struct.pack could be extended to work with variable-length data the way I thought it already did. Otherwise, it already *is* the binary formatting method.
msg171816 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2012-10-02 16:48
It's not constant time.
msg171821 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2012-10-02 18:31
Sorry, I was thinking of something else. Encoding ascii-only text is merely much faster (3x?) than in 3.2- because it directly copies without using the codec.
msg171824 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-10-02 18:49
> Sorry, I was thinking of something else. Encoding ascii-only text is merely
> much faster (3x?) than in 3.2- because it directly copies without using
> the codec.

In 3.3 encoding to ascii or latin1 as fast as memcpy. 12-15x on my computer.
msg180414 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2013-01-22 18:05
Twisted still would like to see this.
msg180415 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2013-01-22 18:24
Implementing this certainly hasn't gotten any easier as 3.x str.format has evoled. The kind of format codes and modifiers wanted to for formatting byte strings might be different that those for text strings. I think it probably needs a pep.
msg180416 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2013-01-22 18:27
Would it be easier if the only format codes/types supported were
bytes, int and float?
msg180419 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2013-01-22 18:47
IMHO a useful API has to provide a more low level functionality like "format number as 32 bit unsigned integer in network endian". A bytes.format() function should support all format chars from http://docs.python.org/3/library/struct.html#format-characters plus all endian and alignment modifiers.
msg180420 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2013-01-22 18:48
The problem is not so much the types allowed the code for dealing with the format string. The parsing code for format specificers is pretty unicode specific now. If that was to be made generic again, it's worth considering exactly what features belong in a bytes format method.
msg180423 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2013-01-22 19:11
Honestly, what Twisted is mostly after is a way to write code that
works both with Python 2 and Python 3. They need the types I mentioned
only (bytes, int, float) and not too many advanced features of
.format() -- but if it's not called .format() or if the syntax is not
a subset of the syntax of Python 2 format syntax, it's not very useful
for them. (They would have to rewrite every protocol implementation in
their tree to use something different, apparently, since .format() has
proven to be the most efficient way to construct larger byte strings
out of smaller pieces, in Python 2.)
msg180426 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-01-22 19:16
Given the issues which have been brought here, I agree that it's PEP material.
msg180427 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2013-01-22 19:17
Serhiy did a nice summary in msg171804, and I think this is PEP material too.  What he wrote could be used as a starting point; the next step would be collecting use cases (the Twisted guys seem to have some).  Once we have defined what we want we can figure out how to implement it (e.g. how much code can be shared with str.format, if it should be bytes.format or something in the struct module).
msg180430 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2013-01-22 19:32
Well, msg171804 makes it a much bigger project than the feature that Twisted actually needs.  Quoting:

* The default formatting should not use str(), but buffer protocol.
Fine.

* There is no place for floating point.
Actually they do need it -- and it's trivial to define, since fp only returns ASCII characters.

* There is no place for locale.
Agreed.

* There is no place for 'r' conversion (possible only for 'a').
Agreed.

* It should include the features of struct.pack(), int.to_bytes() and ctypes.
Not needed.

* Padding should be not only by space, but also by zeros (and possibly by other values).
Not needed.

* Alignment (padding to position divisible by some number).
Not needed.

* In addition to padding and truncating should be the ability to raise an exception in case of discrepancy between the needed and actual lengths.
Not needed.

* It unlikely needed attribute access and indexing.
I don't know, but these features certainly would be well-defined.

* Builtin format() should not work with this.
Fine.

Probably bytes.format() should not try to call v.__format__(); if an extension mechanism is needed it would be called something else, but given the limited set of types needed I think this can be skipped.

The most important requirement from Twisted is actually that it is called .format(), and that the overall format strings look like they did for 8-bit string formatting in Python 2.  In particular b'a{}b{}c'.format(x, y), where x and y are bytes, should be equivalent to b'a' x + b'b' + y + b'c'.
msg180431 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-01-22 19:39
Right, but we're not writing builtin type methods specifically for Twisted. I agree with the idea that the feature set should very limited, actually perhaps more limited than what you just said. For example, I think any kind of implicit str->bytes conversion is a no-no (including the "r" and "a" format codes).

Still, IMO even a simple feature set warrants a PEP, because we want to devise something that's generally useful, not just something which makes porting easier for Twisted.

I also kind of expect Twisted to have worked around the issue before 3.4 is out, anyway.
msg180432 - (view) Author: Glyph Lefkowitz (glyph) Date: 2013-01-22 19:59
On Jan 22, 2013, at 11:39 AM, Antoine Pitrou <report@bugs.python.org> wrote:

> Antoine Pitrou added the comment:
> 
> I agree with the idea that the feature set should very limited, actually perhaps more limited than what you just said. For example, I think any kind of implicit str->bytes conversion is a no-no (including the "r" and "a" format codes).

Twisted doesn't particularly need str->bytes conversion in this step, implicit or otherwise, so I have no problem with leaving that out.

> Still, IMO even a simple feature set warrants a PEP, because we want to devise something that's generally useful, not just something which makes porting easier for Twisted.

Would it really be so bad to add features that would make porting Twisted easier?  Even if you want porting Twisted to be as hard as possible, there are plenty of other Python applications that don't use Twisted which nevertheless need to emit formatted sequences of bytes.  Twisted itself is a good proxy for this class of application; I really don't think that this is overly specific.

> I also kind of expect Twisted to have worked around the issue before 3.4 is out, anyway.

The problem is impossible to work around in the general case.  While we can come up with clever workarounds for things internal to buffering implementations or our own protocols, Twisted exposes an API that allows third parties to write protocol implementations, which quite a few people do.  Every one of those implementations (and every one of Twisted's internal implementations, none of which are ported yet, just the core) faces a series of frustrating implementation choices where the "old" style of b'x' % y or b'x'.format(y) resulted in readable, efficient value interpolation into protocol messages, but the "new" style of b''.join([b'x1', y_to_bytes(y), b'x2']) requires custom functions, inefficient copying, redundant bytes<->text transcoding, and harder-to-read protocol framing literals.  This interacts even more poorly with oddities like bytes(int) returning zeroes now, so there's not even a reasonable 2<->3 compatible way of, say, setting an HTTP content-length header; b'Content-length: {}\r\n'.format(length) is now b''.join([b'Content-length: ', (bytes if bytes is str else str)(length).encode('ascii'), b'\r\n']).

This has negative readability, performance, and convenience implications for the code running on both 2.x and 3.x and it would be really nice to see fixed.  Honestly, it would still be a porting burden to have to use .format(); if you were going to do something _specifically_ to help Twisted, the thing to do would be to make both .format and .__mod__ work; most of our protocol code currently uses % to do its formatting.  However, upgrading to a "modern" API is not an insurmountable burden for Twisted, and I can understand the desire to trade off that work for the simplicity of having less code to maintain in Python core (and less to write for this feature), as long as the "modern" API is actually functional enough to make very common operations close to equivalently convenient.
msg180433 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-01-22 20:13
>  there are plenty of other Python applications that don't use Twisted
> which nevertheless need to emit formatted sequences of bytes.

The fact that "there are plenty of other Python applications that don't
use Twisted which nevertheless need to emit formatted sequences of
bytes" is *precisely* a good reason for this to be discussed more
visibly. Even if it isn't a PEP, it will still benefit from being a
python-dev or python-ideas discussion. We are talking about a method on
a prominent built-in type, not some additional function or method in an
obscure module.

> > I also kind of expect Twisted to have worked around the issue before
> 3.4 is out, anyway.
> 
> The problem is impossible to work around in the general case.

I'm not sure what the "general case" is. What I know from Twisted is
there are many specific cases where, indeed, binary protocol strings are
formed by string formatting, e.g. in the FTP implementation (and for
good reason since those protocols are either ASCII or an ASCII
superset). As a workaround, it would probably be reasonable to make
these protocols use str objects at the heart, and only convert to bytes
after the formatting is done.

> This has negative readability, performance, and convenience
> implications for the code running on both 2.x and 3.x and it would be
> really nice to see fixed.

Code running on both 2.x and 3.x will *by construction* have some
performance pessimizations inside it. It is inherent to that strategy.
Not saying this is necessarily a problem, but you should be aware of it.

>   Honestly, it would still be a porting burden to have to
> use .format(); if you were going to do something _specifically_ to
> help Twisted, the thing to do would be to make both .format
> and .__mod__ work; most of our protocol code currently uses % to do
> its formatting.

I know that :-)
msg180436 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2013-01-22 21:46
2013/1/22 Guido van Rossum <report@bugs.python.org>:
> Twisted still would like to see this.

Sorry, but this argument doesn't convince me. A better argument is
that bytes+bytes+...+bytes is inefficient: it creates a lot of
temporary objects instead of computing the final size directly, or
using realloc.

str%args and str.format() uses realloc() and overallocates its
internal buffer to avoid too many calls to realloc().
msg180437 - (view) Author: Glyph Lefkowitz (glyph) Date: 2013-01-22 22:51
On Jan 22, 2013, at 1:46 PM, STINNER Victor <report@bugs.python.org> wrote:

> 2013/1/22 Guido van Rossum <report@bugs.python.org>:
>> Twisted still would like to see this.
> 
> Sorry, but this argument doesn't convince me. A better argument is
> that bytes+bytes+...+bytes is inefficient: it creates a lot of
> temporary objects instead of computing the final size directly, or
> using realloc.

Uh, yes.  That's one of the reasons (given above) that Twisted would still like to see this.  It seemed to me that Guido was stating a fact there, not making an argument.  The Twisted project *would* like to see this, I can assure you, regardless of whether you're convinced or not :).

> str%args and str.format() uses realloc() and overallocates its
> internal buffer to avoid too many calls to realloc().

More importantly, it's fairly easy to add many optimizations of this type to an API in the style of .format(), even if it's not present in the first round; optimizing bytes + bytes + bytes requires slightly scary interactions with refcounting and potentially GC, like the += optimization.  The API just has more information to go on, and that's a good thing.
msg180439 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2013-01-22 23:34
>it would probably be reasonable to make these protocols use str objects at the heart, and only convert to bytes after the formatting is done.

I presume this would mean adding 'if py3: out = out.encode()' after the formatting. As I said before, this works much better in 3.3+ than in 3.2-. Some actual numbers:

for len in (0, 100, 1000, 10000, 100000):
    a = 'a' * len
    print(timeit("a.encode()", "from __main__ import a"))
>>> 
0.19305401378265558
0.22193721412302575
0.2783227054755883
0.677596406192696
7.124387897799184

Given n = 1000000, these should be microseconds per encoding. Of note: 
the copying of bytes does not double the total time until there are a few thousand chars. Would protocols be using .format for much more than this?

[If speed is really an issue, we could make binary file/socket write methods unicode implementation aware. They could directly access the ascii (or latin-1) bytes in a unicode object, just as they do with a bytes object, and the extra copy could be skipped.]
msg180441 - (view) Author: Glyph Lefkowitz (glyph) Date: 2013-01-23 00:59
> Antoine Pitrou added the comment:
> The fact that "there are plenty of other Python applications that don't
> use Twisted which nevertheless need to emit formatted sequences of
> bytes" is *precisely* a good reason for this to be discussed more
> visibly.

I don't think anyone is opposing discussing it.  I don't personally think such a discussion would be useful, lots of points of view are represented on this ticket, but please feel free to raise it in whatever forum that you feel would be helpful.  (Even if I did object to that I don't see how I could stop you :)).

> I'm not sure what the "general case" is.

The "general case" that I'm referring to is the case of an application writing some protocol logic in terms of constructing some bytes objects and passing them to Twisted.  In other words, Twisted relied upon Python to provide a convenient way to assemble your bytes into protocol messages, and that was removed in 3.x.  We never provided one ourselves and I don't think it would be a particularly good idea to build that kind of basic string-manipulation functionality into Twisted rather than Python.

> What I know from Twisted is there are many specific cases where, indeed,
> binary protocol strings are formed by string formatting, e.g. in the FTP
> implementation (and for good reason since those protocols are either ASCII
> or an ASCII superset).

These protocols (SMTP, SIP, HTTP, IMAP, POP, FTP), are not ASCII (nor are they an "ASCII superset"); they are ASCII commands interspersed with binary data.  It makes sense to treat them as bytes, not text.  In many cases - such as when expressing a length, or a checksum - you _must_ treat them as bytes, or you will emit incorrect data on the wire.  By the time you're dealing with text - if you ever are - you're already somewhere in the body of the protocol, decorated with appropriate metadata.

But my point about the "general case" is that when implementing a *new* protocol with ASCII commands, or maintaining an existing one, bytes-object formatting is a convenient, expressive and performant way to express the interpolation of values in the protocol stream.

> As a workaround, it would probably be reasonable to make
> these protocols use str objects at the heart, and only convert to bytes
> after the formatting is done.

Protocols like SMTP (c.f. "8-bit MIME") and HTTP put binary data in-line; do you suggest that gzipped content be encoded as latin1 so it can squeeze into python 3's str type?  I thought the whole point of the porting pain here was to get a clean separation between bytes and text.  This is exactly why I do not particularly want bytes.format() to allow the presence of strs as formatted values, although that *would* make porting certain things easier.  It makes sense to do your encoding first, then interpolate.

> Code running on both 2.x and 3.x will *by construction* have some
> performance pessimizations inside it. It is inherent to that strategy.
> Not saying this is necessarily a problem, but you should be aware of it.

This is certainly true *now*, but it doesn't necessarily have to be.  Enhancements like this one could make this performance division go away.  In any case, the reason that ported code suffers from a performance penalty is because python 3 has no efficient way of doing this type of bytes construction; even disregarding compatibility with a 2.x codebase, b''.join() and b'' + b'' and (''.format()).encode('charmap') are all slower _and_ more awkward than simply b''.format() or b''%.
msg180442 - (view) Author: Glyph Lefkowitz (glyph) Date: 2013-01-23 01:03
On Jan 22, 2013, at 3:34 PM, Terry J. Reedy <report@bugs.python.org> wrote:

> I presume this would mean adding 'if py3: out = out.encode()' after the formatting. As I said before, this works much better in 3.3+ than in 3.2-. Some actual numbers:

I'm glad that this operation has been optimized, but treating blocks of protocol data as text is a hackish workaround that still doesn't perform as well (even on 3.3+) as bytes formatting in 2.7.

> [If speed is really an issue, we could make binary file/socket write methods unicode implementation aware. They could directly access the ascii (or latin-1) bytes in a unicode object, just as they do with a bytes object, and the extra copy could be skipped.]

Yes, speed is really an issue - this kind of message construction is on the critical path of many of the more popular protocols implemented with Twisted.  But trying to work around the performance issue by pretending that strings are bytes will just give new life to old bugs.  We've been loudly rejecting unicode from sockets I think for as long as Python has had unicode, and that's the way it should remain.
msg180445 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-01-23 07:11
Le mardi 22 janvier 2013 à 23:34 +0000, Terry J. Reedy a écrit :
> Terry J. Reedy added the comment:
> 
> >it would probably be reasonable to make these protocols use str objects at the heart, and only convert to bytes after the formatting is done.
> 
> I presume this would mean adding 'if py3: out = out.encode()' after
> the formatting. As I said before, this works much better in 3.3+ than
> in 3.2-.

So what? We're discussing a feature that, at best, will be present in
3.4 and not before.
msg180446 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-01-23 07:27
> > What I know from Twisted is there are many specific cases where, indeed,
> > binary protocol strings are formed by string formatting, e.g. in the FTP
> > implementation (and for good reason since those protocols are either ASCII
> > or an ASCII superset).
> 
> These protocols (SMTP, SIP, HTTP, IMAP, POP, FTP), are not ASCII (nor
> are they an "ASCII superset"); they are ASCII commands interspersed
> with binary data.

The "ASCII superset commands" part is clearly separated from the "binary
data" part. Your own LineReceiver is able to switch between "raw mode"
and "line mode"; one is text and the other is binary.

> In many cases - such as when expressing a length, or a checksum - you
> _must_ treat them as bytes, or you will emit incorrect data on the
> wire.

This is a non-sequitur. You can fully well take the len() of some
*binary* data, format it using "%d" in a *string* Content-Length header,
then encode the headers using utf-8 (or whatever encoding scheme the
protocol mandates). Then at the end you concatenate the encoded headers
and the body. I'm sure you're already doing the moral equivalent of
this, except that the encoding step is absent.

So, yes, it is reasonably possible, and it even makes sense.

> This is exactly why I do not particularly want bytes.format() to allow
> the presence of strs as formatted values, although that *would* make
> porting certain things easier.

At this point, I would remind you that I'm not againt bytes.format(),
but I'd like it to be discussed in the open rather on the bug tracker. 

And, yes, starting that discusssion is, IMO, the proponents' job :-)

> even disregarding compatibility with a 2.x codebase, b''.join() and
> b'' + b'' and (''.format()).encode('charmap') are all slower _and_
> more awkward than simply b''.format() or b''%.

How can existing constructions be slower than non-existing constructions
that don't have performance numbers at all?

Besides, if b''.join() is too slow, it deserves to be improved. Or
perhaps you should try bytearray instead, or even io.BytesIO.
msg180447 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2013-01-23 07:29
After re-reading everything, I have somewhat changed my mind on this proposal. Perhaps 3.0 threw out too much, making it overly difficult to do some things that were to easy in 2.x and to write cross-version code.

String formatting converts all arguments to strings, using str as the default converter, but gives particular attention to formatting ints and floats. It then interpolates the resulting strings into the template string. Until msg180430, posted just half a day ago, I did not see a coherent idea of what bytes.format should be. The main problem is that there is no general bytes converter equivalent to str. I believe this is the core reason bytes.format was eliminated in 3.0.

Much of the discussion here and elsewhere has been about str.format + additions, where the additions would accommodate various possible conversions. But I now see that this was trying to do too much. Guido's subset proposal cuts this all out by proposing to only convert ints and floats as done in 2.x. So bytes.format would only convert ints and floats and otherwise would interpolate bytes into a bytes template. This should cover a large fraction of use cases. The user would be responsible for converting anything else, or converting ints and floats otherwise, with explicit calls to bytes, str.encode, struct.pack, or custom functions*..

I believe only two changes are needed to the specification of str.format, other than the obvious things like prefixing strings with 'b' and changing 'fill character' to 'fill byte'.  Since general conversion would not be be done, the '! conversion' field would be eliminated. In the format specifier, the default 's' would mean that the corresponding argument must be a bytes objects, rather than any object converted by str.

# possible portability function for 'other' classes:

if py2: strb = str
else:
  def strb(ob): return str(ob).encode()
msg180448 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2013-01-23 07:31
I admit that it is puzzling that string interpolation is apparently the fastest way to assemble byte strings. It involves parsing the format string, so it ought to be slower than anything that merely concatenates (such as cStringIO). (I do understand why + is inefficient, as it creates temporary objects)
msg180449 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2013-01-23 08:02
I don't believe it either. I find join consistently faster than format:

python2.7 -m timeit -s 'x = [b"x"*1000']*10 'b"".join(x)'
1000000 loops, best of 3: 0.686 usec per loop

python2.7 -m timeit -s 'x = b"x"*1000'
'(b"{}{}{}{}{}{}{}{}{}{}").format(x, x, x, x, x, x, x, x, x, x)'
100000 loops, best of 3: 2.37 usec per loop

Try longer strings, same results (though less pronounced):

python2.7 -m timeit -s 'x = [b"x"*10000']*10 'b"".join(x)'
100000 loops, best of 3: 3.54 usec per loop

python2.7 -m timeit -s 'x = b"x"*10000'
'(b"{}{}{}{}{}{}{}{}{}{}").format(x, x, x, x, x, x, x, x, x, x)'
100000 loops, best of 3: 7.35 usec per loop

I'm guessing the advantage of format() is that it allows the
occasional formatting of a float or int.

And % is not significantly faster:

python2.7 -m timeit -s 'x = b"x"*1000' '(b"%s%s%s%s%s%s%s%s%s%s") %
(x, x, x, x, x, x, x, x, x, x)'
100000 loops, best of 3: 2.31 usec per loop

python2.7 -m timeit -s 'x = b"x"*10000' '(b"%s%s%s%s%s%s%s%s%s%s") %
(x, x, x, x, x, x, x, x, x, x)'
100000 loops, best of 3: 6.81 usec per loop

python2.7 -m timeit -s 'x = b"x"*100000' '(b"%s%s%s%s%s%s%s%s%s%s") %
(x, x, x, x, x, x, x, x, x, x)'
1000 loops, best of 3: 565 usec per loop
msg180452 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2013-01-23 09:36
I think ''.join() will always be faster than ''.format(), for a number of reasons (some already stated):
- it doesn't have to pass the format string
- it doesn't have to do the __format__ lookup and call the resulting function (although I believe there's an optimization for str)
- it doesn't have to consider the conversion and formatting steps

Whether b''.format() would have to lookup and call __format__ remains to be seen. From what I've read, maybe baking in knowledge of bytes, float, and int would be good enough. I suspect there might be some need for datetimes, but I could be wrong.

The above said, code using b''.format() would definitely be easier to write and understand that a lot of individual field formatting followed by a .join().
msg180453 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-01-23 09:58
> Whether b''.format() would have to lookup and call __format__ remains
> to be seen. From what I've read, maybe baking in knowledge of bytes,
> float, and int would be good enough. I suspect there might be some
> need for datetimes, but I could be wrong.

The __bytes__ method (and/or tp_buffer) may be a better discriminator than
__format__. It would also allow combining arbitrary buffer objects without
making tons of copies.
What it also means is that "format()" may not be the best method name for
this. It is less about formatting than about combining.

Also, it's not obvious what "formatting" a number as bytes should do.
Should it mimick the bytes constructor:

>>> bytes(5)
b'\x00\x00\x00\x00\x00'

Should it mimick the int to_bytes() method:

>>> (5).to_bytes(4, 'little')
b'\x05\x00\x00\x00'

Numbers currently don't have a __bytes__ method:

>>> (5).__bytes__()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'int' object has no attribute '__bytes__'
msg180454 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2013-01-23 10:00
I retract the datetime comment. Given what we're trying to accomplish, I think we only need to support types that are supported by 2.7's %-formatting.
msg180466 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2013-01-23 15:48
Remember, the only reason to add this would be to enable writing code
that works in both 2.7 and 3.4. So it has to be called .format() and
it has to format numbers as decimal strings by default.
msg180489 - (view) Author: Glyph Lefkowitz (glyph) Date: 2013-01-23 18:57
On Jan 22, 2013, at 11:27 PM, Antoine Pitrou <report@bugs.python.org> wrote:

> Antoine Pitrou added the comment:
> 
> The "ASCII superset commands" part is clearly separated from the "binary
> data" part. Your own LineReceiver is able to switch between "raw mode"
> and "line mode"; one is text and the other is binary.

This is incorrect.  "Lines" are just CRLF (0x0D0A) separated chunks of data.  For example, SMTP is always in line-mode, but messages ("data lines") may contain arbitrary 8-bit data.

> This is a non-sequitur. You can fully well (...)
> So, yes, it is reasonably possible, and it even makes sense.

I concede it is possible to implement what you're talking about, but it still requires encoding things which are potentially 8-bit data.  Yes, there are many corners of protocols where said data looks like text, but it is an optical illusion.

>> even disregarding compatibility with a 2.x codebase, b''.join() and
>> b'' + b'' and (''.format()).encode('charmap') are all slower _and_
>> more awkward than simply b''.format() or b''%.
> 
> How can existing constructions be slower than non-existing constructions
> that don't have performance numbers at all?

Sorry, "in 2.x" :).

> Besides, if b''.join() is too slow, it deserves to be improved. Or
> perhaps you should try bytearray instead, or even io.BytesIO.

As others have noted, b''.join is *not* slower than b''.format for simply assembling strings; b''.join is indeed faster at that and I didn't mean to say it wasn't.  The performance improvement shows up when you are assembling complex messages that contain a smattering of ints, floats, and other chunks of bytes; mostly in that you can avoid a bunch of python code execution and python function calls when formatting those values.  The trouble with cooking up an example of this is that it starts to involve a bunch of additional code complexity and it requires careful framing to make sure the other complexity isn't what's getting in the way.  I will try to come up with one, maybe doing so will prove even this contention wrong.

But, the main issue here is expressiveness, not performance.
msg180490 - (view) Author: Glyph Lefkowitz (glyph) Date: 2013-01-23 18:58
On Jan 22, 2013, at 11:31 PM, Martin v. Löwis <report@bugs.python.org> wrote:

> I admit that it is puzzling that string interpolation is apparently the fastest way to assemble byte strings. It involves parsing the format string, so it ought to be slower than anything that merely concatenates (such as cStringIO). (I do understand why + is inefficient, as it creates temporary objects)

You're correct about this; see my previous comment.
msg180491 - (view) Author: Glyph Lefkowitz (glyph) Date: 2013-01-23 19:00
On Jan 23, 2013, at 1:58 AM, Antoine Pitrou <report@bugs.python.org> wrote:

> Numbers currently don't have a __bytes__ method:
> 
>>>> (5).__bytes__()
> Traceback (most recent call last):
>  File "<stdin>", line 1, in <module>
> AttributeError: 'int' object has no attribute '__bytes__'

They do have some rather odd behavior when passed to the builtin though:

>>> bytes(10)
b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'

It would be much more convenient for me if bytes(int) returned the ASCIIfication of that int; but honestly, even an error would be better than this behavior.  (If I wanted this behavior - which I never have - I'd rather it be a classmethod, invoked like "bytes.zeroes(n)".)
msg180492 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-01-23 19:02
> They do have some rather odd behavior when passed to the builtin
> though:
> 
> >>> bytes(10)
> b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
> 
> It would be much more convenient for me if bytes(int) returned the
> ASCIIfication of that int; but honestly, even an error would be better
> than this behavior.  (If I wanted this behavior - which I never have -
> I'd rather it be a classmethod, invoked like "bytes.zeroes(n)".)

I would agree with you, but it's probably too late to change...
msg180493 - (view) Author: Glyph Lefkowitz (glyph) Date: 2013-01-23 19:04
On Jan 23, 2013, at 11:02 AM, Antoine Pitrou <report@bugs.python.org> wrote:

> I would agree with you, but it's probably too late to change...

Understandable, and, in any case, out of scope for this ticket.
msg180500 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2013-01-23 23:55
So it sounds like the use case is (as Glyph said in msg180432):

- Provide a transition for users of 2.7's of str %-formatting into a style that's compatible with both str in 2.7 and bytes in 3.4.

In that case the only options I see are to implement __mod__ or .format for bytes in 3.4. I'd of course prefer to use .format, although __mod__ would probably make the transition easier (no need to move to .format first). It would probably also make the implementation easier, since there's so much less code in str.__mod__. But let's assume we're using .format [1].

Given the restricted use case, and assuming we using .format, the implementation would not need to support:
- Types other than bytes, int, float.
- Subclasses of these types with custom formatting.
- !s, !r, or !a (none of the ! conversions). [2]

But it would support all of the specifiers for formatting strs (except now for bytes), floats, and ints.

I haven't looked through the str.format or {str,int,float}.__format__ code since the PEP 393 work, so I'm not really sure if we could stringlib-ify the code again, or if it would just be easier to reimplement it as separate bytes-only code.

[1] It's open for debate whether .format or .__mod__ is preferable.
[2] Since %-formatting supports %r and %s, this point is arguable.
msg198112 - (view) Author: Augie Fackler (durin42) Date: 2013-09-19 18:41
I'd like to put a nudge towards supporting the __mod__ interface on bytes - for Mercurial this is the single biggest impediment to even getting our testrunner working, much less starting the porting process.
msg199181 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-10-08 08:53
> I'd like to put a nudge towards supporting the __mod__ interface on bytes - 
> for Mercurial this is the single biggest impediment to even getting our
> testrunner working, much less starting the porting process.

Given a spec hasn't been written (bytes.__mod__ can't support the same things as str.__mod__), and nobody seems to step up to write it, I'd say this is unlikely to appear in 3.4.
msg199199 - (view) Author: Augie Fackler (durin42) Date: 2013-10-08 12:55
Is there any chance we could just have it work for bytes, ints, and floats? That'd solve the immediate need, and it'd be obviously correct how to have those behave.

Punting this to 3.5 basically means we'll have to either wait for 3.5, or do something awful like use cffi to grab sprintf to port Mercurial.
msg199203 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2013-10-08 13:35
If you could write up a concrete proposal, including which format specifiers would be supported, that would be helpful.

Would it be extensible with something like __bformat__?

There's really quite a bit of work to be done to specify how this would work.
msg199204 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2013-10-08 13:38
Also, with the PEP 393 changes, the implementation will be much more difficult. Sharing code with str (unicode) will likely be impossible, or require much refactoring of the existing code.
msg199206 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-10-08 15:08
> Is there any chance we could just have it work for bytes, ints, and
> floats? That'd solve the immediate need, and it'd be obviously
> correct how to have those behave.

You mean "%s" and "%d"? 

> Punting this to 3.5 basically means we'll have to either wait for
> 3.5, or do something awful like use cffi to grab sprintf to port
> Mercurial.

Or write a pure Python implementation.
msg199207 - (view) Author: Augie Fackler (durin42) Date: 2013-10-08 15:10
On Tue, Oct 8, 2013 at 11:08 AM, Antoine Pitrou <report@bugs.python.org>wrote:

> > Is there any chance we could just have it work for bytes, ints, and
> > floats? That'd solve the immediate need, and it'd be obviously
> > correct how to have those behave.
>
> You mean "%s" and "%d"?
>

Basically, yes.

>
> > Punting this to 3.5 basically means we'll have to either wait for
> > 3.5, or do something awful like use cffi to grab sprintf to port
> > Mercurial.
>
> Or write a pure Python implementation.

Hah. Probably too slow for anything beyond a proof of concept, no?
msg199251 - (view) Author: Glyph Lefkowitz (glyph) Date: 2013-10-08 21:10
On Oct 8, 2013, at 8:10 AM, Augie Fackler <report@bugs.python.org> wrote:

> Hah. Probably too slow for anything beyond a proof of concept, no?

It should perform acceptably on PyPy ;-).
msg199253 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-10-08 21:11
> > > Punting this to 3.5 basically means we'll have to either wait for
> > > 3.5, or do something awful like use cffi to grab sprintf to port
> > > Mercurial.
> >
> > Or write a pure Python implementation.
> 
> Hah. Probably too slow for anything beyond a proof of concept, no?

If it's only for the Mercurial test suite, that shouldn't be a problem?
msg199254 - (view) Author: Augie Fackler (durin42) Date: 2013-10-08 21:17
On Tue, Oct 8, 2013 at 5:11 PM, Antoine Pitrou <report@bugs.python.org>wrote:

>
> Antoine Pitrou added the comment:
>
> > > > Punting this to 3.5 basically means we'll have to either wait for
> > > > 3.5, or do something awful like use cffi to grab sprintf to port
> > > > Mercurial.
> > >
> > > Or write a pure Python implementation.
> >
> > Hah. Probably too slow for anything beyond a proof of concept, no?
>
> If it's only for the Mercurial test suite, that shouldn't be a problem?

It's not just the testsuite though: we do this _all over_ hg itself. For
example, status needs to do something like this:

sys.stdout.write('%(state)s %(path)s\n' % {'state': 'M', 'path':
'some/filesystem/path'})

except we don't know the encoding of the filesystem path (Hi unix!) so we
have to treat the whole thing as opaque bytes. It's even more fun for
'log', becase then it's got localized strings in it as well.
msg199258 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2013-10-08 21:24
2013/10/8 Augie Fackler <report@bugs.python.org>:
> sys.stdout.write('%(state)s %(path)s\n' % {'state': 'M', 'path':
> 'some/filesystem/path'})
>
> except we don't know the encoding of the filesystem path (Hi unix!) so we
> have to treat the whole thing as opaque bytes.

You are doing it wrong. In Python 3, you "should" store filenames as
Unicode (str type). If Python fails to decode a filename, undecodable
bytes are stored as surrogate characters (see the PEP 383).

The Unicode type became natural in Python 3, as byte string (old "str"
type) was natural in Python 2.

sys.stdout.write() expects a Unicode string, not a byte string.

Does it mean that Mercurial is moving to Python 3? Cool :-)
msg199260 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2013-10-08 21:35
I've lost track what we were talking about. I thought we were trying to support b'<something>'.format() in 3.4, for a restricted set of arguments.

I don't see how a third-party package is going to help, if the goal is to allow 3.4 to be source compatible with 2.7. And the recent example uses %-formatting, which is not the subject of this ticket.

What proposal is actually on the table here?
msg199264 - (view) Author: Glyph Lefkowitz (glyph) Date: 2013-10-08 22:19
On Oct 8, 2013, at 2:35 PM, Eric V. Smith wrote:

> What proposal is actually on the table here?

Sorry Eric, you're right, there is too much discussion here.  This issue ought to be about .format, like the title says.  There should be a separate ticket for %-formatting, since it seems to be an almost wholly unrelated task.  While I'm sympathetic to Mercurial's issues, they're somewhat different from Twisted's, in that we're willing to adopt the "one new way" to do things in order to achieve compatibility whereas that would be too hard for Mercurial.
msg199265 - (view) Author: Augie Fackler (durin42) Date: 2013-10-08 22:19
On Oct 8, 2013, at 5:24 PM, STINNER Victor <report@bugs.python.org> wrote:

> 
> STINNER Victor added the comment:
> 
> 2013/10/8 Augie Fackler <report@bugs.python.org>:
>> sys.stdout.write('%(state)s %(path)s\n' % {'state': 'M', 'path':
>> 'some/filesystem/path'})
>> 
>> except we don't know the encoding of the filesystem path (Hi unix!) so we
>> have to treat the whole thing as opaque bytes.
> 
> You are doing it wrong. In Python 3, you "should" store filenames as
> Unicode (str type). If Python fails to decode a filename, undecodable
> bytes are stored as surrogate characters (see the PEP 383).

No, I'm not. In Mercurial, all end-user data is OPAQUE BYTES, and must remain that way. We're not able to change either our on-disk data format OR our stdout format, even to support a newer version of Python. I don't know the encoding of the filename's bytes, but I _must_ faithfully reproduce them exactly as they are or I'll break tools like make(1) and patch(1). Similarly, if a file goes from ISO-8859-1 to UTF-8, I have to emit a diff that has some ISO bytes and some UTF bytes - it's not in *any* valid encoding. Changing that is a showstopper regression.

> The Unicode type became natural in Python 3, as byte string (old "str"
> type) was natural in Python 2.
> 
> sys.stdout.write() expects a Unicode string, not a byte string.

Ouch. Is there any way to write things to stderr and stdout without decoding and hopelessly breaking user data?

> Does it mean that Mercurial is moving to Python 3? Cool :-)

Not likely, honestly. I tackle this when I've got some spare cycles and my ability to handle pain is high. As it stands, I have the test-runner barely working, but it's making wrong assumptions to get there. The best estimate is that it's a year of work to upgrade to Python 3.

> 
> ----------
> 
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue3982>
> _______________________________________
msg199266 - (view) Author: Augie Fackler (durin42) Date: 2013-10-08 22:20
On Oct 8, 2013, at 6:19 PM, Glyph Lefkowitz <report@bugs.python.org> wrote:

> Glyph Lefkowitz added the comment:
> 
> On Oct 8, 2013, at 2:35 PM, Eric V. Smith wrote:
> 
>> What proposal is actually on the table here?
> 
> Sorry Eric, you're right, there is too much discussion here.  This issue ought to be about .format, like the title says.  There should be a separate ticket for %-formatting, since it seems to be an almost wholly unrelated task.  While I'm sympathetic to Mercurial's issues, they're somewhat different from Twisted's, in that we're willing to adopt the "one new way" to do things in order to achieve compatibility whereas that would be too hard for Mercurial.

Yeah, my bad too. I suppose I should add a new bug for %-formatting on bytes objects?

Note that for hg, we can't drop Python 2.6 or so (we'll only drop *2.4* if we can do 2.6 and some 3.x from a single source tree) for a while, due to supporting the system interpreter on a variety of LTS platforms.
msg199267 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2013-10-08 22:28
Augie, to understand what Viktor meant, I suggest reading
http://www.python.org/dev/peps/pep-0383/
One point of the pep is round-trip filenames without loss on all systems, which is just what you say you need.
msg199268 - (view) Author: Augie Fackler (durin42) Date: 2013-10-08 22:31
On Oct 8, 2013, at 6:28 PM, "Terry J. Reedy" <report@bugs.python.org> wrote:

> http://www.python.org/dev/peps/pep-0383/
> One point of the pep is round-trip filenames without loss on all systems, which is just what you say you need.

At a quick skim, likely not good enough, because http://en.wikipedia.org/wiki/Shift_JIS isn't completely ASCII-compatible, and we've got a fair number of users on weird Shift-JIS using platforms.
msg199270 - (view) Author: Glyph Lefkowitz (glyph) Date: 2013-10-08 22:45
On Oct 8, 2013, at 3:19 PM, Augie Fackler wrote:

> No, I'm not. In Mercurial, all end-user data is OPAQUE BYTES, and must remain that way.

The PEP 383 technique for handling file names is completely capable of round-tripping exact bytes, given one encoding for both input and output.  You can still handle file names this way internally in Mercurial and not risk disturbing any observable output.  You do not need to change that in order to do what Victor suggests.

We should get together in some other forum and discuss file-name handling though, since you can't actually round-trip "opaque bytes" through a *filesystem* and not disturb your output.

> Ouch. Is there any way to write things to stderr and stdout without decoding and hopelessly breaking user data?

You can use sys.stdout.buffer.write.
msg199271 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2013-10-09 00:13
Here is a proof of concept Python function, with a minimal test. It is similar to how str.format could be coded in Python, with re.split and ''.join, except that it does not allow anything before : in the format specification. By default (no format spec given), it copies bytes objects without change. If a format specification *is* given, it does not restrict the object, as this code simply uses builtin format sandwiched between decode and encode.
msg199432 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2013-10-11 01:18
> You can use sys.stdout.buffer.write.

Note that there's no guarantee that sys.stdout.buffer exists, e.g. if sys.stdout has been replaced with a StringIO.
msg199438 - (view) Author: Glyph Lefkowitz (glyph) Date: 2013-10-11 02:01
Tempting as it is to reply to the comment about 'buffer' not existing, we're way off topic here.  Let's please keep further comments on this bug to issues about a 'format' methods on the 'bytes' object.
History
Date User Action Args
2014-01-04 06:45:58tshepangsetnosy: + tshepang
2013-12-31 22:03:22brett.cannonsetversions: + Python 3.5, - Python 3.4
2013-12-31 22:03:12brett.cannonsetnosy: + brett.cannon
2013-10-11 14:27:44gvanrossumsetnosy: - gvanrossum
2013-10-11 07:09:17Arfreversetnosy: + Arfrever
2013-10-11 02:01:25glyphsetmessages: + msg199438
2013-10-11 01:18:50ezio.melottisetmessages: + msg199432
2013-10-09 00:14:41terry.reedysetfiles: - byte_format.py
2013-10-09 00:13:57terry.reedysetfiles: + byte_format.py

messages: + msg199271
2013-10-08 23:57:04terry.reedysetfiles: + byte_format.py
2013-10-08 22:45:39glyphsetmessages: + msg199270
2013-10-08 22:31:18durin42setmessages: + msg199268
2013-10-08 22:28:02terry.reedysetmessages: + msg199267
2013-10-08 22:20:50durin42setmessages: + msg199266
2013-10-08 22:19:42durin42setmessages: + msg199265
2013-10-08 22:19:14glyphsetmessages: + msg199264
2013-10-08 21:35:38eric.smithsetmessages: + msg199260
2013-10-08 21:24:53hayposetmessages: + msg199258
2013-10-08 21:17:11durin42setmessages: + msg199254
2013-10-08 21:11:40pitrousetmessages: + msg199253
2013-10-08 21:10:13glyphsetmessages: + msg199251
2013-10-08 15:10:00durin42setmessages: + msg199207
2013-10-08 15:08:36pitrousetmessages: + msg199206
2013-10-08 13:38:09eric.smithsetmessages: + msg199204
2013-10-08 13:35:50eric.smithsetmessages: + msg199203
2013-10-08 12:55:54durin42setmessages: + msg199199
2013-10-08 08:53:47pitrousetmessages: + msg199181
2013-10-06 06:49:36stendecsetnosy: + stendec
2013-09-27 01:13:46nlevitt@gmail.comsetnosy: + nlevitt@gmail.com
2013-09-19 18:41:42durin42setnosy: + durin42
messages: + msg198112
2013-05-06 18:46:27ecir.hanasetnosy: + ecir.hana
2013-03-17 05:16:47gregory.p.smithsetnosy: + gregory.p.smith
2013-03-17 04:51:24barrysetnosy: + barry
2013-01-23 23:55:19eric.smithsetmessages: + msg180500
2013-01-23 19:04:05glyphsetmessages: + msg180493
2013-01-23 19:02:44pitrousetmessages: + msg180492
2013-01-23 19:00:46glyphsetmessages: + msg180491
2013-01-23 18:58:20glyphsetmessages: + msg180490
2013-01-23 18:57:51glyphsetmessages: + msg180489
2013-01-23 16:25:21floxsetnosy: + flox
2013-01-23 15:48:17gvanrossumsetmessages: + msg180466
2013-01-23 10:00:09eric.smithsetmessages: + msg180454
2013-01-23 09:58:10pitrousetmessages: + msg180453
2013-01-23 09:36:15eric.smithsetmessages: + msg180452
2013-01-23 08:02:35gvanrossumsetmessages: + msg180449
2013-01-23 07:31:19loewissetmessages: + msg180448
2013-01-23 07:29:08terry.reedysetmessages: + msg180447
2013-01-23 07:27:37pitrousetmessages: + msg180446
2013-01-23 07:11:30pitrousetmessages: + msg180445
2013-01-23 01:03:22glyphsetmessages: + msg180442
2013-01-23 00:59:30glyphsetmessages: + msg180441
2013-01-22 23:34:32terry.reedysetmessages: + msg180439
2013-01-22 22:51:03glyphsetmessages: + msg180437
2013-01-22 21:46:29hayposetmessages: + msg180436
2013-01-22 20:13:26pitrousetmessages: + msg180433
2013-01-22 19:59:51glyphsetmessages: + msg180432
2013-01-22 19:39:58pitrousetmessages: + msg180431
2013-01-22 19:37:17glyphsetnosy: + glyph
2013-01-22 19:32:28gvanrossumsetmessages: + msg180430
2013-01-22 19:17:23ezio.melottisetmessages: + msg180427
2013-01-22 19:16:09pitrousetnosy: + pitrou
messages: + msg180426
2013-01-22 19:11:45gvanrossumsetmessages: + msg180423
2013-01-22 18:48:21benjamin.petersonsetmessages: + msg180420
2013-01-22 18:47:32christian.heimessetmessages: + msg180419
2013-01-22 18:27:55gvanrossumsetmessages: + msg180416
2013-01-22 18:24:22benjamin.petersonsetmessages: + msg180415
2013-01-22 18:05:52gvanrossumsetnosy: + gvanrossum
messages: + msg180414
2012-10-02 18:49:55serhiy.storchakasetmessages: + msg171824
2012-10-02 18:31:58terry.reedysetmessages: + msg171821
2012-10-02 16:48:58benjamin.petersonsetmessages: + msg171816
2012-10-02 16:47:54terry.reedysetmessages: + msg171815
2012-10-02 14:52:22benjamin.petersonsetmessages: + msg171806
2012-10-02 13:55:45serhiy.storchakasetmessages: + msg171804
2012-10-02 13:38:48exarkunsetmessages: + msg171803
2012-10-02 13:22:00eric.smithsetmessages: + msg171801
2012-10-02 13:18:35exarkunsetmessages: + msg171800
2012-10-02 13:16:12eric.smithsetmessages: + msg171799
2012-10-02 13:08:50serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg171796
2012-10-02 12:40:11christian.heimessetnosy: + christian.heimes

messages: + msg171795
versions: + Python 3.4, - Python 3.1
2012-10-02 12:05:43exarkunsetstatus: closed -> open

nosy: + exarkun
messages: + msg171791

resolution: rejected ->
2012-06-21 23:41:28terry.reedysetmessages: + msg163379
2012-06-21 21:21:06uausetmessages: + msg163369
2012-04-15 06:52:43vadmiumsetnosy: + vadmium
2011-10-25 05:33:27ezio.melottisetnosy: + ezio.melotti
2011-03-07 19:09:19terry.reedysetnosy: loewis, terry.reedy, haypo, eric.smith, benjamin.peterson, arjennienhuis, uau
messages: + msg130284
2011-03-07 12:34:54arjennienhuissetnosy: loewis, terry.reedy, haypo, eric.smith, benjamin.peterson, arjennienhuis, uau
messages: + msg130253
2011-03-07 00:47:09terry.reedysetnosy: + terry.reedy
messages: + msg130215
2011-01-27 18:54:02uausetnosy: + uau
messages: + msg127210
2010-02-19 00:45:39benjamin.petersonsetstatus: open -> closed
resolution: rejected
2009-07-11 16:47:34loewissetmessages: + msg90428
2009-07-11 16:28:45arjennienhuissetmessages: + msg90425
2009-07-11 15:52:16loewissetmessages: + msg90423
2009-07-11 13:54:38arjennienhuissetnosy: + arjennienhuis
messages: + msg90421
2009-03-24 23:37:06eric.smithsetmessages: + msg84123
2009-03-24 23:28:28hayposetmessages: + msg84121
2008-09-29 21:33:09loewissetmessages: + msg74050
2008-09-29 10:56:38hayposetmessages: + msg74022
2008-09-29 10:50:58eric.smithsetmessages: + msg74021
2008-09-29 10:22:20hayposetnosy: + haypo
messages: + msg74019
2008-09-27 18:42:03loewissetmessages: + msg73939
2008-09-27 17:39:01benjamin.petersonsetmessages: + msg73938
2008-09-27 17:35:36loewissetnosy: + loewis
messages: + msg73937
2008-09-27 17:35:10benjamin.petersonsetmessages: + msg73936
2008-09-27 17:33:52eric.smithsetmessages: + msg73935
2008-09-27 15:50:40benjamin.petersoncreate