PEP: XXX
Title: Adding peek to io ABCs
Version: $Revision$
Last-Modified: $Date$
Author: Andrew Barnert <abarnert AT yahoo.com>
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 19-Jul-2014
Python-Version: 3.5
Post-History: 

Abstract
========

This PEP proposes an extension to the stream protocol, and its
formalization in the ``io`` module ABCs, to allow file objects to
expose their buffering sufficiently to make it easier to write
wrappers.

Rationale
=========

In the 2014 discussion on adding support for alternative newlines to
the ``io`` classes [#thread2014]_, the biggest objection was that it
would be better for users to just wrap the file objects to add the
functionality they need than to change the stdlib to add that
functionality.

However, as explained in the related PEP [#pep-newline]_, the current
design of the ``io`` module makes this impossible. 

The problem
-----------

Briefly, some of the functionality provided by the ``io`` classes
requires access to the internal buffers, which subclasses, delegators,
and other users of the ``io`` classes do not have.

In the context of the alternative-newlines discussion, the problem is
that ``TextIOWrapper.readline`` calls private methods that manipulate
its buffer. There's no way for an alternative implementation to do the
same thing.

You can, of course, write an alternative implementation that reads
character by character, but that is much slower.

Alternatively, you can write a wrapper that does its own
buffering. That's not as slow as going character by character, but
it's still nowhere near as fast as ``readline``, and it has some added
problems. First, it means you have two iterators that share the same
file but have separate buffers and out-of-sync file pointers. Second,
the object won't be a file-like object. Making it support the entire
TextIOBase API is obviously not impossible, but it's a pretty
significant amount of work.

A relatively small change would fix that, making not only that PEP,
but other potential use cases (not that anyone has identified any
yet...), much easier for end users or PyPI projects to solve on their
own.

Other languages
---------------

In a low-level language like C, iterating character by character over
a buffered stream is not much of a problem; if the iteration is
written as a macro or inline function (like C's ``getc``), it comes
down to checking, deferencing, and incrementing a pointer in a
structure (and of course reading more data when the buffer is
exhausted). There's basically no overhead, so you can write anything
you can think of as efficiently as the stdlib functions. This means
the stdlib functions don't have to handle every case you could ever
think of. ``fgets`` is hardcoded to break on ``'\n'``; if you want to
break on ``'\r'`` or ``'\0'`` instead, you write your own
``fgets``-like function.

In higher-level languages, this may not be acceptable. Perl, Ruby,
PHP, and other languages effectively try to provide everything you
might ever want in their file classes.

But Java takes a different approach. The ``java.io`` package
[#javaio]_ is roughly equivalent to Python's ``io`` module. However,
it has two major differences that are relevant here.

First, it's built around the notion of composable filters. There are
filters to wrap a raw binary stream in a text stream or a struct-like
data stream, to add a buffer to any binary or text stream, to add line
buffering to any buffered stream, to add line numbering to any
line-buffered stream, etc.

Second, each one of these pieces provides as little functionality as
possible, instead exposing just enough to make it easy to build
anything else you need. For example, ``BufferedInputStream`` adds
methods ``mark`` and ``reset`` that allow anyone to peek ahead in the
file. ``BufferedReader`` uses this to build its ``readLine`` method.
If you want different functionality than ``readLine`` provides, you
can just build your own method that does something similar, and it can
be just as simple and efficient as the standard version. (And of
course you can still layer line numbering, etc. on top of it, if you
want.)

As usual, Java may have gone a little overboard with the design, and
gotten stuck with some early ideas that didn't pan out. For example,
the filter concept can't be used quite as generally as intended (or at
least couldn't in Java 1.1), so the interfaces turn out to be not all
that useful and you end up just directly using ``BufferedReader`` or
one of its subclasses, etc. But the basic idea is sound.

The design of C++ ``iostreams`` is roughly similar to Java's ``io``
package, although the details are pretty different. A stream is a
pretty simple thing, and it's relatively easy to layer new streams on
top of old ones any way you want to. Unfortunately, almost nobody
actually understands ``iostreams`` well enough to do that;
fortunately, ``Boost::iostream`` provides a bunch of wrappers that let
you write things in terms of chains of filters that are a lot easier
to understand (and avoids some of the mistakes Java made--for example,
decoding is done by a ``code_converter``, which is not chained in the
same way as filters).

Haskell takes a completely different approach. It has lazy strings
(which are basically just lazy lists of characters), and string
functions like ``split`` are inherently lazy. So, all you really need
is ``readFile`` [#haskell-readFile]_ to lazily read the entire
file. There's no need to explicitly expose the buffer; it's just part
of the string that's already been evaluated. So, if you want to do the
equivalent of ``mark`` and ``split``, you just need to hold onto the
marked position to keep everything from that point on alive. Very
clever, but of course it relies on some complicated compiler
optimization to avoid actually generating a new cons, etc., for each
character as you iterate.

Anyway, Python already has effectively a radically stripped-down
version of the Java design. While some pieces of that design might be
nice to import (with changes) into Python, there's no need to go too
far. Really, the only thing that's necessary is having some consistent
way to access the buffer.

Specification
=============

``IOBase`` will gain a new optional method, ``peek``, defined as:

    ``peek([size])``

        Return bytes from the stream without advancing the
        position. At most one single read on the raw stream is done to
        satisfy the call. The number of bytes returned may be less or
        more than requested.

        This is not part of the IOBase API and may not exist on some
        implementations.

This is effectively already there today, just not documented as
such. ``BufferedReader`` and various other stdlib binary file types
such as ``bz2.BZ2File`` provide this method, and ``IOBase.readline``
will use it if present (and will fall back to ``read(1)`` byte by byte
otherwise).

``TextIOBase`` will gain a corresponding optional method that works
with characters instead of bytes (exactly as it does for most of the
other ``IOBase`` methods):

    ``peek([size])``

        Return characters from the stream without advancing the
        position. At most one single read on the raw stream is done to
        satisfy the call. The number of characters returned may be
        less or more than requested.

        This is not part of the TextIOBase API and may not exist on some
        implementations.

``TextIOBase.readline`` will, instead of raising
``UnsupportedOperation``, implement the ``readline`` behavior on top
of ``peek`` (if present) or ``read(1)`` (if not), in roughly the same
way ``IOBase`` does. Of course it has to handle universal newline
behavior, so it won't be *quite* as simple as the binary version, but
it's not too difficult.

``TextIOWrapper`` will implement ``peek`` on top of
``self->decoded_chars`` and ``textiowrapper_read_chunk`` (or, in
``_pyio``, ``self._get_decoded_chars()``,
``self._set_decoded_chars()``, and ``self._read_chunk()``).

``TextIOWrapper.readline`` could either stay there unchanged, or be
removed; it comes down to whether the generic version can be as
efficient as the version that uses private members.

Performance
===========

From a quick test, here's the time for different alternative
``readline`` implementations, as a multiplier of the time taken by
``TextIOWrapper.readline``:

* 6.2x: ``_pyio.TextIOWrapper.readline``.
* 2.7x: C-implemented character by character.
* 27.x: Python-implemented character by character.
* 1.2x: new ``TextIOBase.readline``.
* 1.9x: new ``_pyio.TextIOBase.readline`` on ``TextIOWrapper``.
* 9.2x: new ``_pyio.TextIOBase.readline`` on ``_pyio.TextIOWrapper``.
* 3.4x: ``resplit`` around ``TextIOWrapper.read``

Take these with a grain of salt, as it's an unscientific single
%timeit around an incomplete patch... But assuming these numbers hold
up, I think that means:

* We need to keep ``TextIOWrapper.readline`` as an optimization; a 20%
  slowdown on all text processing is not acceptable.
* While a 90% slowdown for adding your own ``readline`` style method
  isn't wonderful, it's a whole lot faster than any of the
  alternatives that work today, as well as being cleaner than any
  (except the character-by-character read, which is too slow to be
  considered).

The shed
========

Why ``peek``?
-------------

There are other ways to provide access to the buffer besides ``peek``.

In Java, you ``mark`` the file, and ``reset`` to a previous mark. This
would be easier for some file-like types to implement than ``peek``,
harder for others. (Also, in some cases, it might be easier, but not
necessarily desirable--e.g., a non-buffered but seekable file could do
it with ``tell`` and ``seek``, but that could be a lot slower in some
cases--or, worse, fast on some filesystems and slow on others.)

In C++, you read whatever you want, then ``putback`` whatever you
didn't want. It's a little more complicated to implement, but a little
easier to understand. It's closer to what ``TextIOWrapper`` does under
the covers. It has the advantage that you can ``putback`` things you
didn't actually read--although that might actually be a disadvantage.

So, why is ``peek`` better than ``mark`` or ``putback``?

Really, it's only because a lot of code outside the ``io`` module
(maybe even outside the stdlib) already provides ``peek`` for binary
files. Unless there's a compelling reason to do otherwise, it's a lot
better to just document and standardize existing practice than to
replace it.

Why stop here?
--------------

Fundamentally changing the API of the ``io`` module, even in a
relatively small way, is obviously not something to be undertaken
lightly. If it's worth doing at all, isn't it worth re-evaluating
the design to make sure ``peek`` is really sufficient for everything
that might come up in the future, rather than just solving this one
use case and possibly having to change the whole design again two
years from now?

First, this should be a very painless change, which will not affect
any existing third-party code, unless someone happens to have added a
method named ``peek`` to a ``TextIOBase`` that does something
different (which seems unlikely). It's hard to imagine the same being
true for any more radical change. So, there's no reason to hold this
change in abeyance until there's a more radical change worth doing.

Second, there's nothing fundamentally wrong with the ``io`` module
design in the first place; it's missing one method, but the
organization of the classes and how they interact is perfectly fine.

But there is one idea that's at least defensible, hence the next
section.

Why not instead make it easier to wrap streams?
-----------------------------------------------

In Python, unlike C++ and Java, there's no natural way to write a
``TextIOBase`` that acts like a filter on another ``TextIOBase``.

Adding a ``peek`` method makes it easier to implement whatever methods
you want relatively efficiently and simply, but it's still not nearly
as easy as it could be to implement the full API.

But other features in Python make this less necessary than in Java.

It's trivial to wrap an object and delegate all methods to it,
either dynamically via ``__getattr__``, by reading its dict and
building wrappers dynamically at either class or object construction
time, etc.

Also, in many cases, this isn't even really necessary; you can just
subclass ``TextIOWrapper`` or ``BufferedReader``, then construct an
instance of your class out of the other instance's ``buffer`` or
``raw`` member. This isn't quite as easy as it could be (see issue
14017 [#issue14017]_), but that's a problem that could be fixed easily
if anyone needed it badly enough.

And of course there's nothing stopping you from writing your
``readline`` style method as a free function, because it's only
accessing public members; Python doesn't even approach a "pure OO"
language where free functions are forbidden.

If worst comes to worst, you can just always monkeypatch the instance.

So, it's not that this is a bad idea, but that it's probably just not worth
the effort.

References
==========

.. [#thread2014] http://thread.gmane.org/gmane.comp.python.ideas/28310
.. [#pep-newline] At this point, still an unsubmitted draft, but it
		  can be found in [#thread2014]_.
.. [#java-io] http://docs.oracle.com/javase/7/docs/api/java/io/package-summary.html
.. [#java-readline] http://docs.oracle.com/javase/7/docs/api/java/io/BufferedReader.html#readLine()
.. [#haskell-readFile] https://hackage.haskell.org/package/base-4.7.0.0/docs/System-IO.html
.. [#issue14017] http://bugs.python.org/issue14017

Copyright
=========

This document has been placed in the public domain.