Abstract ======== This PEP proposes adding %-interpolation to the bytes object. Rational ======== A distruptive but useful change introduced in Python 3.0 was the clean separation of byte strings (i.e. the "bytes" object) from character strings (i.e. the "str" object). The benefit is that character encodings must be explicitly specified and the risk of corrupting character data is reduced. Unfortunately, this separation has made writing certain types of programs more complicated and verbose. For example, programs that deal with network protocols often manipulate ASCII encoded strings. Since the "bytes" type does not support string formatting, extra encoding and decoding between the "str" type is required. For simplicity and convenience it is desireable to introduce formatting methods to "bytes" that allow formatting of ASCII-encoded character data. This change would blur the clean separation of byte strings and character strings. However, it is felt that the practical benefits outweigh the purity costs. The implicit assumption of ASCII-encoding would be limited to formatting methods. One source of many problems with the Python 2 Unicode implementation is the implicit coercion of Unicode character strings into byte strings using the "ascii" codec. If the character strings contain only ASCII characters, all was well. However, if the string contains a non-ASCII character then coercion causes an exception. The combination of implicit coercion and value dependent failures has proven to be a recipe for hard to debug errors. A program may seem to work correctly when tested (e.g. string input that happened to be ASCII only) but later would fail, often with a traceback far from the source of the real error. The formatting methods for bytes() should avoid this problem by not implicitly encoding data that might fail based on the content of the data. Proposed semantics for bytes formatting ======================================= Special method __ascii__ ------------------------ A new special method, analogous to __str__, is introduced. This method takes no arguments and returns a str object. The return value should contain only ASCII characters. Only objects with obvious ASCII representations should implement __ascii__. Objects with natural byte representations should implement __bytes__ or the Py_buffer API. Objects containing non-ASCII character data should implement encode(). %-interpolation --------------- All the numeric formatting codes (such as %x, %o, %e, %f, %g, etc.) will be supported, and will work as they do for str, including the padding, justification and other related modifiers. Examples:: >>> b'%4x' % 10 b' a' >>> b'%-4.2d' % -1 b'-01 ' >>> b'%d' % 1.2 b'1' %c will insert a single byte, either from an int in range(256), or from a bytes argument of length 1. Examples:: >>> b'%c' % 48 b'0' >>> b'%c' % b'a' b'a' %s will accept any object that can be coerced into bytes or implements the __ascii__ method, otherwise a TypeError is raised. Examples:: >>> b'%s' % b'abc' b'abc' >>> b'%s' % 3.14 b'3.14' >>> b'%4s' % 12 b' 12' >>> b'%s' % 'hi' Traceback (most recent call last): ... TypeError: 'hi' has no __ascii__ method, perhaps you need to encode it? %a will accept any object, call ascii() on it and insert the result. Examples:: >>> b'%a' % 1.2 b'1.2' >>> b'%a' % 'hi' b"'hi'" To ease porting of Python 2 code, %r is an alias for %a. format ------ Implementing a format() method is outside the scope of this proposal. Proposed variations =================== Instead of introducing a new special method, have numeric types implement __bytes__. - Adding __bytes__ to the int object is not backwards compatible. bytes() already has an incompatible meaning. It has been suggested to use %b for bytes instead of %s. - Rejected, using %s will making porting code from Python 2 easier. It was suggested to disallow %s from accepting numbers. - Rejected, to ease porting of Python 2 code, %s should accept number operands. It has been proposed to automatically use .encode('ascii','strict') for str arguments to %s. - Rejected as this would lead to intermittent failures. Better to have the operation always fail so the trouble-spot can be correctly fixed. It has been proposed to have %s return the ascii-encoded repr when the value is a str (b'%s' % 'abc' --> b"'abc'"). - Rejected as this would lead to hard to debug failures far from the problem site. Better to have the operation always fail so the trouble-spot can be easily fixed. Copyright ========= This document has been placed in the public domain. .. Local Variables: mode: indented-text indent-tabs-mode: nil sentence-end-double-space: t fill-column: 70 coding: utf-8 End: