PEP: XXX Title: Adding peek to io ABCs Version: $Revision$ Last-Modified: $Date$ Author: Andrew Barnert Status: Draft Type: Standards Track Content-Type: text/x-rst Created: 19-Jul-2014 Python-Version: 3.5 Post-History: Abstract ======== This PEP proposes an extension to the stream protocol, and its formalization in the ``io`` module ABCs, to allow file objects to expose their buffering sufficiently to make it easier to write wrappers. Rationale ========= In the 2014 discussion on adding support for alternative newlines to the ``io`` classes [#thread2014]_, the biggest objection was that it would be better for users to just wrap the file objects to add the functionality they need than to change the stdlib to add that functionality. However, as explained in the related PEP [#pep-newline]_, the current design of the ``io`` module makes this impossible. The problem ----------- Briefly, some of the functionality provided by the ``io`` classes requires access to the internal buffers, which subclasses, delegators, and other users of the ``io`` classes do not have. In the context of the alternative-newlines discussion, the problem is that ``TextIOWrapper.readline`` calls private methods that manipulate its buffer. There's no way for an alternative implementation to do the same thing. You can, of course, write an alternative implementation that reads character by character, but that is much slower. Alternatively, you can write a wrapper that does its own buffering. That's not as slow as going character by character, but it's still nowhere near as fast as ``readline``, and it has some added problems. First, it means you have two iterators that share the same file but have separate buffers and out-of-sync file pointers. Second, the object won't be a file-like object. Making it support the entire TextIOBase API is obviously not impossible, but it's a pretty significant amount of work. A relatively small change would fix that, making not only that PEP, but other potential use cases (not that anyone has identified any yet...), much easier for end users or PyPI projects to solve on their own. Other languages --------------- In a low-level language like C, iterating character by character over a buffered stream is not much of a problem; if the iteration is written as a macro or inline function (like C's ``getc``), it comes down to checking, deferencing, and incrementing a pointer in a structure (and of course reading more data when the buffer is exhausted). There's basically no overhead, so you can write anything you can think of as efficiently as the stdlib functions. This means the stdlib functions don't have to handle every case you could ever think of. ``fgets`` is hardcoded to break on ``'\n'``; if you want to break on ``'\r'`` or ``'\0'`` instead, you write your own ``fgets``-like function. In higher-level languages, this may not be acceptable. Perl, Ruby, PHP, and other languages effectively try to provide everything you might ever want in their file classes. But Java takes a different approach. The ``java.io`` package [#javaio]_ is roughly equivalent to Python's ``io`` module. However, it has two major differences that are relevant here. First, it's built around the notion of composable filters. There are filters to wrap a raw binary stream in a text stream or a struct-like data stream, to add a buffer to any binary or text stream, to add line buffering to any buffered stream, to add line numbering to any line-buffered stream, etc. Second, each one of these pieces provides as little functionality as possible, instead exposing just enough to make it easy to build anything else you need. For example, ``BufferedInputStream`` adds methods ``mark`` and ``reset`` that allow anyone to peek ahead in the file. ``BufferedReader`` uses this to build its ``readLine`` method. If you want different functionality than ``readLine`` provides, you can just build your own method that does something similar, and it can be just as simple and efficient as the standard version. (And of course you can still layer line numbering, etc. on top of it, if you want.) As usual, Java may have gone a little overboard with the design, and gotten stuck with some early ideas that didn't pan out. For example, the filter concept can't be used quite as generally as intended (or at least couldn't in Java 1.1), so the interfaces turn out to be not all that useful and you end up just directly using ``BufferedReader`` or one of its subclasses, etc. But the basic idea is sound. The design of C++ ``iostreams`` is roughly similar to Java's ``io`` package, although the details are pretty different. A stream is a pretty simple thing, and it's relatively easy to layer new streams on top of old ones any way you want to. Unfortunately, almost nobody actually understands ``iostreams`` well enough to do that; fortunately, ``Boost::iostream`` provides a bunch of wrappers that let you write things in terms of chains of filters that are a lot easier to understand (and avoids some of the mistakes Java made--for example, decoding is done by a ``code_converter``, which is not chained in the same way as filters). Haskell takes a completely different approach. It has lazy strings (which are basically just lazy lists of characters), and string functions like ``split`` are inherently lazy. So, all you really need is ``readFile`` [#haskell-readFile]_ to lazily read the entire file. There's no need to explicitly expose the buffer; it's just part of the string that's already been evaluated. So, if you want to do the equivalent of ``mark`` and ``split``, you just need to hold onto the marked position to keep everything from that point on alive. Very clever, but of course it relies on some complicated compiler optimization to avoid actually generating a new cons, etc., for each character as you iterate. Anyway, Python already has effectively a radically stripped-down version of the Java design. While some pieces of that design might be nice to import (with changes) into Python, there's no need to go too far. Really, the only thing that's necessary is having some consistent way to access the buffer. Specification ============= ``IOBase`` will gain a new optional method, ``peek``, defined as: ``peek([size])`` Return bytes from the stream without advancing the position. At most one single read on the raw stream is done to satisfy the call. The number of bytes returned may be less or more than requested. This is not part of the IOBase API and may not exist on some implementations. This is effectively already there today, just not documented as such. ``BufferedReader`` and various other stdlib binary file types such as ``bz2.BZ2File`` provide this method, and ``IOBase.readline`` will use it if present (and will fall back to ``read(1)`` byte by byte otherwise). ``TextIOBase`` will gain a corresponding optional method that works with characters instead of bytes (exactly as it does for most of the other ``IOBase`` methods): ``peek([size])`` Return characters from the stream without advancing the position. At most one single read on the raw stream is done to satisfy the call. The number of characters returned may be less or more than requested. This is not part of the TextIOBase API and may not exist on some implementations. ``TextIOBase.readline`` will, instead of raising ``UnsupportedOperation``, implement the ``readline`` behavior on top of ``peek`` (if present) or ``read(1)`` (if not), in roughly the same way ``IOBase`` does. Of course it has to handle universal newline behavior, so it won't be *quite* as simple as the binary version, but it's not too difficult. ``TextIOWrapper`` will implement ``peek`` on top of ``self->decoded_chars`` and ``textiowrapper_read_chunk`` (or, in ``_pyio``, ``self._get_decoded_chars()``, ``self._set_decoded_chars()``, and ``self._read_chunk()``). ``TextIOWrapper.readline`` could either stay there unchanged, or be removed; it comes down to whether the generic version can be as efficient as the version that uses private members. Performance =========== From a quick test, here's the time for different alternative ``readline`` implementations, as a multiplier of the time taken by ``TextIOWrapper.readline``: * 6.2x: ``_pyio.TextIOWrapper.readline``. * 2.7x: C-implemented character by character. * 27.x: Python-implemented character by character. * 1.2x: new ``TextIOBase.readline``. * 1.9x: new ``_pyio.TextIOBase.readline`` on ``TextIOWrapper``. * 9.2x: new ``_pyio.TextIOBase.readline`` on ``_pyio.TextIOWrapper``. * 3.4x: ``resplit`` around ``TextIOWrapper.read`` Take these with a grain of salt, as it's an unscientific single %timeit around an incomplete patch... But assuming these numbers hold up, I think that means: * We need to keep ``TextIOWrapper.readline`` as an optimization; a 20% slowdown on all text processing is not acceptable. * While a 90% slowdown for adding your own ``readline`` style method isn't wonderful, it's a whole lot faster than any of the alternatives that work today, as well as being cleaner than any (except the character-by-character read, which is too slow to be considered). The shed ======== Why ``peek``? ------------- There are other ways to provide access to the buffer besides ``peek``. In Java, you ``mark`` the file, and ``reset`` to a previous mark. This would be easier for some file-like types to implement than ``peek``, harder for others. (Also, in some cases, it might be easier, but not necessarily desirable--e.g., a non-buffered but seekable file could do it with ``tell`` and ``seek``, but that could be a lot slower in some cases--or, worse, fast on some filesystems and slow on others.) In C++, you read whatever you want, then ``putback`` whatever you didn't want. It's a little more complicated to implement, but a little easier to understand. It's closer to what ``TextIOWrapper`` does under the covers. It has the advantage that you can ``putback`` things you didn't actually read--although that might actually be a disadvantage. So, why is ``peek`` better than ``mark`` or ``putback``? Really, it's only because a lot of code outside the ``io`` module (maybe even outside the stdlib) already provides ``peek`` for binary files. Unless there's a compelling reason to do otherwise, it's a lot better to just document and standardize existing practice than to replace it. Why stop here? -------------- Fundamentally changing the API of the ``io`` module, even in a relatively small way, is obviously not something to be undertaken lightly. If it's worth doing at all, isn't it worth re-evaluating the design to make sure ``peek`` is really sufficient for everything that might come up in the future, rather than just solving this one use case and possibly having to change the whole design again two years from now? First, this should be a very painless change, which will not affect any existing third-party code, unless someone happens to have added a method named ``peek`` to a ``TextIOBase`` that does something different (which seems unlikely). It's hard to imagine the same being true for any more radical change. So, there's no reason to hold this change in abeyance until there's a more radical change worth doing. Second, there's nothing fundamentally wrong with the ``io`` module design in the first place; it's missing one method, but the organization of the classes and how they interact is perfectly fine. But there is one idea that's at least defensible, hence the next section. Why not instead make it easier to wrap streams? ----------------------------------------------- In Python, unlike C++ and Java, there's no natural way to write a ``TextIOBase`` that acts like a filter on another ``TextIOBase``. Adding a ``peek`` method makes it easier to implement whatever methods you want relatively efficiently and simply, but it's still not nearly as easy as it could be to implement the full API. But other features in Python make this less necessary than in Java. It's trivial to wrap an object and delegate all methods to it, either dynamically via ``__getattr__``, by reading its dict and building wrappers dynamically at either class or object construction time, etc. Also, in many cases, this isn't even really necessary; you can just subclass ``TextIOWrapper`` or ``BufferedReader``, then construct an instance of your class out of the other instance's ``buffer`` or ``raw`` member. This isn't quite as easy as it could be (see issue 14017 [#issue14017]_), but that's a problem that could be fixed easily if anyone needed it badly enough. And of course there's nothing stopping you from writing your ``readline`` style method as a free function, because it's only accessing public members; Python doesn't even approach a "pure OO" language where free functions are forbidden. If worst comes to worst, you can just always monkeypatch the instance. So, it's not that this is a bad idea, but that it's probably just not worth the effort. References ========== .. [#thread2014] http://thread.gmane.org/gmane.comp.python.ideas/28310 .. [#pep-newline] At this point, still an unsubmitted draft, but it can be found in [#thread2014]_. .. [#java-io] http://docs.oracle.com/javase/7/docs/api/java/io/package-summary.html .. [#java-readline] http://docs.oracle.com/javase/7/docs/api/java/io/BufferedReader.html#readLine() .. [#haskell-readFile] https://hackage.haskell.org/package/base-4.7.0.0/docs/System-IO.html .. [#issue14017] http://bugs.python.org/issue14017 Copyright ========= This document has been placed in the public domain.