PEP: XXX
Title: Alternative newlines for file objects
Version: $Revision$
Last-Modified: $Date$
Author: Andrew Barnert <abarnert AT yahoo.com>
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 19-Jul-2014
Python-Version: 3.5
Post-History: 

Abstract
========

This PEP proposes a means to use alternative newlines in both text and
binary files, in place of the usual '\n', '\r', and '\r\n'.

Rationale
=========

This idea has previously been discussed in (at least) a 2005 thread on
python-list [#thread2005]_ (not properly threaded; search for
"Canonical way of dealing with null-separated lines?"), issue #1152248
[#issue1152248]_ on the bug tracker, and a 2014 thread on python-ideas
[#thread2014]_.

tl;dr
-----

Python makes it very easy to iterate through lines or other records in
a binary or text file if they're delimited by one of the standard
newline strings, but very hard to do so if they're delimited by any
other string.

Adding a way to override the default newline would make this just as
easy as in other languages (Perl, awk, etc.) that have similar
features.

Use cases
---------

The Unix ``find`` tool, by default, prints a list of filenames, each
on its own line. However, because Unix filenames may contain newline
characters, this can be ambiguous. Therefore, it has an option,
``-print0``, that uses the ASCII null character (``'\0'``) in place of
the newline character (``\n``). While newlines in filenames are rare
on Unix, they may be used for files from other platforms--e.g.,
newlines were commonplace on classic Mac and Palm OS. Also, even when
there's no risk of dealing with newlines in a filename, ``-print0`` is
often useful because ``find`` has no way to escape or quote spaces or
other special characters that many other tools (like ``xargs`` or
``sh``) may treat as separators. Therefore, both the GNU [#gnufind]_
and BSD [#bsdfind]_ manpages recommend using ``-print0`` in place of
the default `-print`` whenever possible.

Similarly, there are Windows and Mac file formats that use a Unicode
null (``U+0000``) to separate UTF-16 strings, and l10n message catalog
formats that use either 8-bit or Unicode nulls for a similar purpose.

Some file formats use a single newline as a field separator, and a
blank line as a record separator. For example mail-merge address lists
are often stored this way. This is also often used as an example in
awk tutorials like [#awktut]_, to show off awk's special handling for
this case, although it's not clear how often it comes up in real
life. While it is of course relatively easy to write code that translates
lines into records by splitting the iterable on blank lines, it's even
easier to set ``'\n\n'`` as the line separator and just iterate the
records directly.

The gawk manual [#gawkman]_ suggests using formfeed (``\f``) in file
formats that need multiline records that may contain blank lines,
although it's not clear how often people actually do this.

Tools reading files meant for older systems (e.g., emulator wrapper
GUIs) often have to deal with other newlines. For example, early
Acorn/BBC systems used ``'\n\r'``, while EBCDIC used ``'\x15'`` (or,
in text mode, ``'\x85'``).

On issue #1152248, ysj.ray said that he had files that use ``'\t'`` as
a record separator. While this probably isn't a common format, it
serves as an example that shows that there may be many uncommon and
unlikely-seeming formats that users may have to deal with, so we want
a solution general enough to fit all of them (rather than, e.g., just
adding ``'\0'`` to the list of universal newlines).

Can't you just wrap the file?
-----------------------------

Sure. In fact, sometimes you *need* to wrap the file, because you're,
e.g., getting the output of ``find --print0`` on ``sys.stdin``.

The question is, wrap it with what?

The obvious solution is something like this:

::

    def resplit(strings, separator):
        partialLine = None
        for s in strings:
            if partialLine:
                partialLine += s
            else:
                partialLine = s
            if not s:
                break
            lines = partialLine.split(separator)
            partialLine = lines.pop()
            yield from lines
        if partialLine:
            yield partialLine

    with open(path) as f:
        chunks = iter(partial(f.read, 4096), '')
        lines = resplit(chunks, '\0')
        lines = (line + '\0' for line in lines)
        for line in lines:
            process(line)

(You can of course wrap up the file reading with the splitting, wrap
up the terminator-appending with the splitting, skip the
terminator-appending if you're just going to strip it anyway, etc.)

Needless to say, this is not exactly trivial. (In the 2005 thread, it
took three tries to get it right, and the final version there still
doesn't work for binary files.) It also doesn't give you a file
object, it's just a plain old generator that doesn't support the
stream protocol. It also means you now have two iterators (the
resplit, and the file itself) that are both alive and referencing the
same file, but with separate buffers and out-of-sync implicit
positions. And it's slow. (From a quick, unscientific test, an
_io.TextIOWrapper wrapped in this iterator is slower than a
_pyio.TextIOWrapper...)

The first of these problems could be solved by just putting something
like ``resplit`` in the stdlib (maybe in ``itertools``?), and that
could conceivably solve the last as well (by coding it in C), but it's
not going to help the other problems.

What about writing the wrapper as a subclass of ``TextIOWrapper`` (or
as a ``TextIOBase`` implementation that delegates to a
``TextIOWrapper``), with its own ``readline`` override? That would
solve all of the problems--but unfortunately, it's at the very least
difficult, if not impossible, to do this well. See the bikeshedding
section and the separate PEP on adding a ``peek`` method
([#pep-peek]_) for further details.

It's worth mentioning that wrapping an already-open file is already
necessary today, and will be exactly as necessarily to handle
alternative newlines as it is to handle, e.g., changing encodings
today.

As a side note, it's not as easy as it should be to move the
``buffer`` from one ``TextIOWrapper`` to another (see [issue
#14017]_), but that's a less serious issue, and not a new one raised
by this proposal. (In fact, exposing the *newline* argument as an
attribute, as proposed below, partly solves that existing problem.)

What about files that don't come from ``open`` or the ``io`` module?
--------------------------------------------------------------------

Most people who ask this are asking about something like this:

    I see another problem with doing this by modifying the ``open()``
    call: it does not work for filehandles creates using other methods
    such as ``pipe()``

This is just a misconception of how files work in Python. The file
descriptors you get back from ``os.pipe()`` are not file objects,
they're just integers. These file descriptors don't have a
``readline`` method, or any notion of lines at all; the only way to
read from them is to call ``os.read``. If you want to use such a file
descriptor as a file object, the way you do it is to pass it call
``open(fd)``. In other words, it *does* work for file handles created
using methods such as ``os.pipe()``.

However, there are cases where you've got an actual file-like object
handed to you by some other module, something that supports much of
the file protocol but doesn't actually implement any of the ABCs or
otherwise use the ``io`` module. And often, that something won't be
sufficiently close to the ABCs to allow wrapping it up. So (assuming
something like ``resplit`` isn't good enough for your use case), how
does this proposal solve that problem?

The answer is simple: it doesn't, and there's no possible way any
proposal reasonably could. The module has to be rewritten to give you
objects that actually implement one of the ``io`` ABCs, or at least
duck type it closely enough, if you want the ``io`` module to help you
at all. The good news is that such a change will almost trivially add
support for alternative newlines, so there won't be any need to wrap
up the result once the change is made.

Other languages
---------------

Many languages provide a way to change the separator to a single
character, an arbitrary string, or a regular expression, either
globally, for a specific file, or for a single line.

* In AWK [#gawkman]_, setting the ``RS`` variable to any non-empty
  string or (in GNU awk) regular expression makes that string or
  regexp the line terminator for the current input file. Setting it to
  an empty string terminates on a blank line (that is, ``'\n\n'``) in
  some versions, or on any sequence of one or more blank lines (that
  is, ``'\n\n+'``) in others.
* In sed [#sedman]_, the line terminator is always ``'\n'``.
* In Perl [#perldoc]_, the ``$/`` variable, if set,  has the exact same
  effect as gawk's ``RS``; if unset, ``'\n'`` is the line
  terminator. This variable can be set globally or locally, with the
  usual Perl scoping rules.
* Ruby [#rubydoc]_, ``gets`` and similar methods take an optional
  argument ``sep``. If ``sep`` is a non-empty string, that string is
  used as a line terminator; if it's an empty string, any sequence of
  one or more blank lines is the terminator; if it's nil, the entire
  file is read in. The global variable ``$/`` provides a default value
  for ``sep``.
* C/POSIX [#posixdoc]_ ``gets`` uses the platform-specific default
  newline sequence for text files, ``'\n'`` for binary files. (C does
  not require ``'\n'`` to mean ``'\x0a'``, but POSIX does.) No
  alternatives are provided, but the ``getc`` macro is intended to
  be fast enough to loop character by character, and it's reasonably
  well known how to use ``fscanf`` to read up to an arbitrary
  character.
* PHP [#phpdoc_] ``fgets`` works like C, but it also provides a
  ``stream_get_line`` function that takes an arbitrary string
  ``$ending`` argument.
* Node.js [#nodedoc]_ always uses ``'\n'`` as the line terminator.
* In C++ ``iostreams`` [#cppdoc_], the ``getline`` function takes any
  single character (``widen``-able to the appropriate type) as a
  delimiter, defaulting to ``'\n'``. (There is no way to handle
  multiple-character separators, but you can use text-mode translation
  to deal with the special case of ``'\r\n'``.) However, the
  ``iostreams`` library was intentionally designed as a number of
  layered components that expose just enough information that it
  should be easy to write and plug in new functionality.
* Java [#javadoc]_ ``readline`` always accepts any of ``'\n'``,
  ``'\r'``, or ``'\r\n'`` as the line terminator. However, as with
  C++, the ``java.io`` classes are intentionally designed to let you
  build and stack filters to add your own functionality.
* .NET [#dotnetdoc]_ ``Readline`` always uses ``'\r\n'`` as the line
  terminator.
* Haskell [#haskelldoc]_ mkFileHandle takes a ``NewlineMode`` at file
  construction time. However, you typically read the whole file as lazy
  string, and use lazy split functions and the like rather than calling
  anything like ``readline``.
* Cocoa [#cocoadoc]_ has no line reading or buffering; you're expected
  to do it manually or use C stdio (or just read the whole file in at
  once, which there's a zillion ways of doing).

Specification
=============

Except for the ``open`` function [#open]_, all of the changes are
within the [#io]_ module. (In CPython, this is actually the ``_io``
package of C extension modules and the ``_pyio`` Python module; the
``io`` module itself is just a wrapper, which won't need any changes.)

``open``
--------

The ``open`` function currently takes a *newline* argument, but raises
if given a non-``None`` value for a binary file. This will change to
just pass the value along.

The documentation currently says:

    *newline* controls how *universal newlines* mode works (it only
    applies to text mode). It can be None, '', '\n', '\r', and
    '\r\n'. It works as follows:

    When reading input from the stream, if *newline* is ``None``,
    universal newlines mode is enabled. Lines in the input can end in
    ``'\n'``, ``'\r'``, or ``'\r\n'``, and these are translated into
    ``'\n'`` before being returned to the caller. If it is ``''``,
    universal newlines mode is enabled, but line endings are returned
    to the caller untranslated. If it has any of the other legal
    values, input lines are only terminated by the given string, and
    the line ending is returned to the caller untranslated.

    When writing output to the stream, if *newline* is ``None``, any
    ``'\n'`` characters written are translated to the system default
    line separator, ``os.linesep``. If *newline* is ``''`` or
    ``'\n'``, no translation takes place. If *newline* is any of the
    other legal values, any ``'\n'`` characters written are translated
    to the given string.

It will instead say:

    *newline* controls the line separator, and (for text mode) how
    *universal newlines* mode works. It can be None, or any string
    *(for text mode) or byte string (for binary mode). It works as
    *follows:

    When reading input from the stream, if *newline* is ``None``, the
    behavior depends on the mode. For text mode, universal newlines
    mode is enabled. Lines in the input can end in ``'\n'``, ``'\r'``,
    or ``'\r\n'``, and these are translated into ``'\n'`` before being
    returned to the caller. For binary mode, lines can only end in
    ``b'\n'``. If it is an empty string, in text mode, universal
    newlines mode is enabled, while in binary mode, only ``b'\n'`` is
    a line ending. In either mode, line endings are returned to the
    caller untranslated. If it has any of the other legal values,
    input lines are only terminated by the given string, and the line
    ending is returned to the caller untranslated.

    When writing output to the stream, for binary files, *newline* is
    ignored. For text files, it controls output translation. if
    *newline* is ``None``, any ``'\n'`` characters written are
    translated to the system default line separator,
    ``os.linesep``. If *newline* is ``''`` or ``'\n'``, no translation
    takes place. If *newline* is any of the other legal values, any
    ``'\n'`` characters written are translated to the given string.

In particular, for text modes, the *newline* argument is passed only
to the ``TextIOWrapper`` (as is already true), and for binary modes
it's passed only to the ``BufferedReader``, ``BufferedWriter``, or
``BufferedRandom`` (instead of raising).

``IOBase``
----------

``IOBase`` will grow a new attribute:

* ``newline`` (usually the *newline* value passed to ``open``) will be
  used for recognizing line terminators. This is not part of the
  ``IOBase`` API and may not exist in some implementations.

For all of the concrete classes in ``io``, ``newline`` will be
present, and read-only, and immutable (e.g., ``bytes`` or ``str``).

``IOBase`` is also where ``readline`` is documented, and it also
provides a default mixin implementation that's used by all of the
binary file types in the module (and many elsewhere). The
documentation currently says:

    The line terminator is always ``b'\n'`` for binary files; for text
    files, the *newline* argument to ``open()`` can be used to select
    the line terminator(s) recognized.

It will instead say:

    The ``newline`` attribute, if present, will select the line
    terminator(s) recognized, as explained in ``open()``. If not
    present, the default is ``b'\n'`` for binary files, or any of
    ``'\n'``, ``'\r'``, or ``'\r\n'`` for text files.

The changes to the implementation of this function are pretty obvious:
search for ``self.newline`` if present, ``b'\n'`` otherwise, instead of
always ``b'\n'``, and make sure to add ``len(newline)`` instead of
``1`` when found. (Both the Python and C implementations have clever
tricks that implicitly rely on the ``1``. The Python version uses
``readahead.find(b"\n") + 1) or len(readahead)``, assuming that ``-1 +
1`` is falsey; the C version uses a ``++`` in the line that checks for
``'\n'`` so it doesn't have to increment if found. Although these
shortcuts do make the code a little briefer, they don't seem to have
any performance benefits, and make it a little harder to understand,
so there's no great loss in giving them up.)

Binary file classes
-------------------

``RawIOBase`` and ``BufferedIOBase`` need no changes.

``FileIO``, ``BufferedReader``, and ``BytesIO`` will now take a
*newline* argument, convert it to ``bytes`` if necessary, store it in
struct member ``self->newline`` (C) or attribute ``self._newline``
(Python), and expose it as a read-only property ``self.newline``.

``BufferedWriter`` will take a *newline* argument and ignore it.

``BufferedRWPair`` will take a *newline* argument and pass it to the
reader and writer objects it constructs.

``BufferedRandom`` will take a *newline* argument and pass it to its
reader and writer supers.

Text file classes
-----------------

``TextIOBase`` needs no changes.

``TextIOWrapper`` currently does almost everything we want, except
that the constructor raises a ``ValueError`` if the *newline* argument
is a string, but ``not in ('', '\n', '\r', '\r\n')``. This check will
be removed.

For consistency, it will also expose ``self->readnl`` (C) or
``self._readnl`` (Python) as read-only property ``self.newline``.

``StringIO`` inherits all relevant behavior from ``TextIOWrapper``,
and needs no changes.

Outside the ``io`` module
-------------------------

There are many other classes, both in the stdlib and elsewhere, that
implement the stream API, and most of them would of course benefit
from adding the same *newline* parameter. But none of them are required
to do so.

Some classes may just work with the changes; others will not. But
those that don't, it will be obvious that they either don't have any
way to provide a *newline* argument, or raise when given one.

At any rate, changes to the rest of the stdlib are outside the scope
of this proposal; if needed, they can be done as separate enhancements
later (just as they will be in third-party libraries).

The shed
========

Where to specify the alternative newline
----------------------------------------

This is the biggest area of contention.

There are a few places the newline could be specified:

* A magic global or local variable (as in awk, Perl, and Ruby).
* In each call to ``readline`` and friends (as in Ruby, PHP, and C++).
* In an attribute of the file object (as... nowhere?).
* At file object constructor time (as in Haskell and C++ with Boost).

The first is obviously a non-starter for Python.

Adding an optional parameter to the ``readline`` method seems
attractive at first, but has two serious problems. First, you can't
pass the same parameter to ``__iter__``. Second, this would mean that
the hundreds of existing file objects that either implement one of the
``IOBase`` ABCs or duck type the implicit stream protocol are no
longer valid file objects--and, worse, there's no good way for a
consumer to check whether a given file object can take a *newline*
parameter to its ``readline`` method.

Adding new methods--e.g., ``readrecord``, ``iterrecords``, and
``readrecords`` to parallel ``readline``, ``__iter__``, and
``readlines``--solves the first problem, and amelioriates the second
(at least you can check with ``hasattr`` or EAFP), but it's still a
serious break with the API to add new methods that existing file
objects don't support. (Also, while this is a much more minor problem,
for ``iterrecords`` to work, unlike ``__iter__``, it has to return a
new iterator object that references the file.)

Adding an attribute to the file object avoids some of these
problems. However, changing the line terminator on the fly seems to be
at least as potentially confusing to the user, and as limiting to the
implementation, as changing the encoding; Python doesn't allow the
latter, so it shouldn't allow the former.

Adding a parameter to the constructors solves all of these
problems. Construction is not part of the stream protocol; nobody
expects that opening a file on disk, wrapping a transport in a file,
opening a file within an archive, etc. will have the same interface,
and in general, they don't. The ``__iter__`` method will work exactly
as expected, with no changes.

The difference between adding a *newline* argument to ``readline``
instead of the constructor may not seem obvious, but consider where
you'd find the problem if you tried to use a file-like object that
didn't support it. With the constructor change, before even writing
your code, while looking up the construction syntax for creating the
object, you'd discover that it has no *newline* argument. With the
``readline`` change, you'd write the code and then, later, get an
error far away in your code (or in some library that you use, or in
some customer's code that uses yours) because you have an object that
looks like a file object, and claims to be a ``BufferedReader``, but
has the wrong signature on its ``readline`` method.

Text files or binary files?
---------------------------

During the previous discussions, some people have been convinced that
this feature is obviously needed only for binary files, and makes no
sense for text files--after all, files with ``'\0'`` characters are
obviously not text.

However, that doesn't really work. The output of ``file`` is a list of
filenames, and it makes perfect sense to open it with
``encoding=sys.getfilesystemencoding()`` as a text file. Adding the
``-print0`` argument shouldn't change that; it's still a list of
filenames in the same encoding, they're just separated by an encoded
null instead of an encoded newline. (It's true that ``find`` is
separating encoded filenames with the ``'\0'`` byte, not whatever that
happens to decode to, but it's no accident that the ``'\0'`` byte
always means null in any charset that can be used by a filesystem
under Unix.)

Also, consider the case of UTF-16-LE strings with ``U+0000``
terminators. If there's an ASCII character followed by a newline, the
second half of the ASCII character and the first half of the newline
will be ``'\0\0'``, so the line will be split one character early,
leaving half a character in each line. There's really no sensible way
to read such a file except by decoding before splitting lines--that
is, by reading it as a text file.

Conversely, a few people were convinced that this feature only makes
sense for text files, because binary files don't do newline
conversion--or, more fundamentally, they don't have a concept of
lines, so how can they have a concept of newlines?

The fact that binary files have a ``readline`` method in the first
place already implies that the concept makes sense.

Also, there are plenty of "binary" file formats that are basically
ASCII text.

Consider an HTTP response: it's an ASCII status line and headers with
``'\r\n'`` newlines, followed by a blank line, then an arbitrary body
(which is often a text file in a different encoding, and with
different newlines). While you usually can get away with parsing it as
a binary file with ``'\n'`` line endings and stripping off the
``'\r'`` (which many people do today in Python--and, for that matter,
in PHP), it's obviously *more* reasonable to use the proper line
ending, not *less*.

Output newline translation
--------------------------

Opening text files with a specific *newline* value not only changes
the line separator for input, it also causes any ``'\n'`` characters
to be translated to the given string for output. Should the same be
true for binary files?

On the one hand, it seems simpler to make the behavior identical for
binary and text files than to try to explain how they're different.

On the other hand, it doesn't seem like a good idea to add write
translation for binary files, which have never had them, and which
seems conceptually wrong.

This proposal chooses the latter, because it's a smaller and
hopefully less surprising change.

Input newline translation
-------------------------

Some people have suggested that getting strings or byte strings back
from ``readline`` that don't end in ``'\n'`` could be confusing. After
all, the fact that the file was opened with ``newline='\0'`` could be
very distant in the code from where the ``readline`` call happens.

The first thing to notice is that the same problem already exists. If
you open a file with ``newline='\r'``, your lines will end with
``'\r'``, not ``'\n'``. Code that can't deal with that will have the
exact same problem dealing with ``'\0'``. The fact that some code gets
away with it by sloppily calling ``rstrip()`` on each line (which
happens to work on ``'\r'``, along with truncating any other trailing
whitespace, which is probably a bug...) doesn't mean we want to
encourage that.

Multiple characters?
--------------------

A few people have raised the issue that multiple-character separators
are both harder to deal with in the implementation, and harder to
think about for the user.

However, ``'\r\n'`` pretty obviously needs to be handled.

Also, ``'\n\n'`` is a common enough use case for awk, Perl, and Ruby
all to have special-casing to deal with it.

Stream stacks
-------------

A typical file object is a stack of streams. For example, opening a
file in ``"r"`` mode constructs a ``FileIO`` (a ``RawIOBase`` instance),
wraps that in a ``BufferedReader`` (a ``BufferedIOBase`` instance),
then wraps that in a ``TextIOWrapper`` (a ``TextIOBase`` instance).

So, when opening a file in ``"r"`` mode with a non-default *newline*
value, should the newline be passed to all three objects (or, rather,
``newline.encode(encoding)`` passed to the first two), so that
``f.buffer.readline()`` or ``f.buffer.raw.readline()`` will use the
same separator as ``f.readline()``?

While that seems attractive at first, it may not be a good
idea. Consider the example of null-separated UTF-16 again; passing
``b'\0\0'`` to the ``FileIO`` and ``BufferedReader`` is not going to
work as expected. Also, it's extra complexity that will probably
rarely if ever be used.

Making delegation/inheritance easier instead
--------------------------------------------

As Guido suggested twice, and others suggested as well, the ideal way
to solve this problem would be either:

* Rewrap the file in a subclass of ``TextIOWrapper`` or
  ``BufferedReader`` that adds the extra functionality (overridden
  ``readline`` with or without an extra argument, separate
  ``readuntil``, whatever you prefer).

* Wrap the file in a class that adds the extra functionality and
  delegates everything else to ``TextIOWrapper`` or
  ``BufferedReader``.

Unfortunately, the design of the ``io`` module makes that difficult.

Briefly, the main problem is that ``TextIOWrapper.readline`` needs
access to the internals of its buffer, and there doesn't seem to be
any remotely simple or efficient way to implement ``readline`` short
of reimplementing all of ``TextIOWrapper``. And of course in CPython,
you'd have to do at least a large part of that in C for performance;
if the pure-Python implementation of ``io.TextIOWrapper`` was too slow
to be acceptable, you're probably not going to do significantly
better.

See [#pep-peek]_ for a possible solution to this problem, and further
discussion.

Unfortunately, Guido seems to be strongly against this idea. (For
example: "I never meant to suggest anything that would require pushing
back data into the buffer.") So, I haven't tried to push it.

Also, even ignoring Guido, fundamentally changing the API of the
``io`` module is obviously not something to be undertaken lightly; if
it's worth doing at all, it's probably worth re-evaluating the design
to make sure ``peek`` is really sufficient for everything that might
come up in the future, rather than just solving this one use case and
possibly having to change the whole design again two years from now.

Storing the newline value
-------------------------

Most binary file types inherit their ``readline`` implementation from
``IOBase``, so ``IOBase`` will need to be able to access the *newline*
value passed to the constructor.

Since the ABCs have no public constructor, the obvious way to make the
*newline* value available is to expose it as a member, which
``readline`` will use if present (defaulting to ``b'\n'`` if
not). There are already a number of such members and methods defined
in the ABCs' documentation.

To prevent users from changing this value in mid-stream, it should be
a read-only attribute, and also immutable, but of course there's no
need for code to verify that for third-party classes (consenting
adults and all that).

So, the concrete binary classes in ``io`` will just store the
constructor argument in a private variable, copying it to ``bytes`` if
given a ``bytearray`` or other mutable bytes-like object, and expose
it as a read-only property.

``TextIOWrapper`` does not inherit its ``readline``, but for
consistency it might as well change to using a public ``newline``
attribute instead of a private variable as it does today. This also
solves part of issue #14017, as mentioned earlier.

Simplified rewrapping
---------------------

Since adding a ``newline`` attribute partially solves the problem of
rewrapping files, and the use cases for this change may sometimes
require rewrapping files, this might be a perfect time to completely
solve the problem: add the other missing attribute, ``write_through``,
or add the ``rewrap`` method suggested in issue #14017 (and also add
it to the ``BufferedFoo`` classes, presumably).

While those are both good ideas, they don't really seem to be in scope
for this change.

Fixing the entire stdlib
------------------------

As long as there are third-party file-like objects, there will be
file-like objects that don't support this new behavior. That's why it
was important to make the change outside of the API in the first
place. So, I don't think it's necessary to fix all file-like objects
in the stdlib. Adding *newline* support to any such type would be an
enhancement that could be done independently, for the stdlib as for
any third-party project.

But, in case others disagree, I made a quick survey of classes that
inherit one of the ``io`` classes or implement ``readline`` to get an
overview of the situation. All the ones I looked at fall into one of
these categories:

* Call ``open`` (or some other function like ``socket.makefile`` that
  would presumably already be handled) and delegate ``readline`` to
  the result (e.g., ``tempfile.NamedTemporaryFile``): should just
  work, except that some explicitly validate the *newline* argument,
  and need to stop.
* Inherit from ``TextIOWrapper`` and use its ``readline`` implementation: 
  should just work.
* Inherit from one of the binary concrete classes or ABC mixins and
  use its ``readline`` (e.g., ``socket.makefile``): just need to take
  a *newline* parameter and pass it along.
* As above, but override ``readline`` to add a fast-path optimization
  that's used in certain cases (e.g., ``bz2.BzipFile``): also need to
  skip the fast path ``if self.newline is not None``. (By the way,
  from a quick test, the fast path is not actually faster here, and
  I'm guessing the same is true for any similar class that provides a
  working ``peek`` method, especially those implemented in Python. Of
  course that probably wasn't true in 3.0, but I'm guessing nobody
  tested after the ``io`` overhaul. So in some cases, it might be
  worth just removing the override entirely.)
* As above, but provide a more complicated override of ``readline``
  (e.g., ``zipfile.ZipExtFile``): every example I've found is weird or
  buggy in some way--e.g., tries to implement 2.x-style universal
  newlines even on binary files. I'm not sure what it would even mean
  to add *newline* support to such a class. Should it try to fit in
  with the existing text-like behavior on binary files by treating
  *newline* the way text files do? Turn off the universal newline
  support?
* Not used as a file, or only used internally within the module (e.g.,
  ``email.feedparser``): no change needed.
* Only somewhat file-like (e.g., ``codecs.StreamReader``,
  ``fileinput``): obviously these aren't going to able to inherit
  behavior from the ``io`` classes, and it's arguable whether they
  should. That being said, at least for ``fileinput``, a *newline*
  argument could be useful, and wouldn't be that hard to
  implement. (It just calls ``open`` on each file and reads its lines
  normally.)

The only exception I found was ``gzip.GzipFile``, which implements the
``io`` ABCs, but provides its own custom ``readline`` from scratch
rather than delegating to the super. I'm not sure this is actually
necessary, but, if it is, that class won't gain the *newline* feature
except by implementing it manually.

What about just adding '\0'?
----------------------------

Multiple people suggested that since '\0' is the main example everyone
keeps coming up with, maybe we just need to add that to the list of
valid newlines, rather than open it up to all possible strings.

First, as shown at the top, there are other examples; '\0' may be the
first one that comes to mind, but it's not the only one.

Second, it doesn't really simplify things, either in the documentation
or in the implementation. We still need to add alternative newline
support for binary files; we still need to change any text files that
check their *newline* argument before passing it on to the super
class, or that do ``readline`` manually; we still need to (and already
can) handle multi-character newlines because of ``'\r\n'``; etc.

Finally, adding just ``'\0'`` seems more likely to add to
confusion. For example, if there are exactly four valid non-empty
values for *newline*, and three of those four are also the endings
handled by universal newlines, a reader could easily be forgiven for
expecting ``'\0'`` to also be handled by universal newlines. If,
instead, any non-empty string is allowed for *newline*, no one will be
confused by the list of three universal-newlines endings.

Why not a different parameter instead of reusing *newline*?
-----------------------------------------------------------

For text files, *newline* already controls how lines are
terminated. If we added another parameter that did the exact same
thing, how would they interact?

If we were redesigning the ``io`` module from scratch, maybe it would
be better to separate out the universal-newlines flag and the input
line terminator string. But there doesn't seem to be a clean way to
make that change without breaking all existing uses of *newline*.

Possibly-relevant BDFL comments
===============================

Guido said that he doesn't have time to follow the discussion, but
wants to make sure all of his concerns are met. So, this section lists
the concerns he's raised, in hopes that readers can point out any that
haven't been answered.

* "I don't think it is reasonable to add a new parameter to
  readline(), because streams are widely implemented using duck typing
  -- every implementation would have to be updated to support this."
  Later comments echoed the same idea; asking everyone to add an
  additional method to their file-like object classes like
  ``readuntil``, or to modify an existing method like ``readline``, is
  "unreasonable", and also breaks an established API. It's hard to
  argue with this one, which is exactly why the proposal suggests
  puting the parameter outside the API, in the ``open``/``__init__``
  methods, instead of changing the API. (Credit to Alexander Heger for
  first suggesting this solution on the email thread, and to R. David
  Murray for suggesting the same thing on the bug tracker.)

* "I don't like changing the meaning of the newline argument to open
  (and it doesn't solve enough use cases any way)." This is discussed
  above; any separate new feature would have to interact in some
  hard-to-explain way with the existing feature, and I think that
  would just mean more confusion, not less. (For the parenthetical,
  I'm not sure what use cases it fails to solve.)

* The newline argument is pretty obscure and not very well
  known. ("Well, I had to look up the newline option for open(), even
  though I probably invented it.") Is it a good idea to add
  functionality in a place nobody knows to look? I don't have a great
  answer here, except that a new parameter isn't likely to be any more
  discoverable than an existing one. (Also see the previous question.)

* "I personally think it's preposterous to use \0 as a separator for
  text files (nothing screams binary data like a null byte :-)." This
  is discussed above, but consider the paradigm case of a bunch of
  UTF-8 filenames or, worse, UTF-16-LE translation strings separated
  by ``'\0'`` instead of ``'\n'`` in case some of them have embedded
  ``'\n'`` characters. It's clearly text, and trying to process it as
  binary data makes things more difficult.

* "I don't think it's a big deal if a method named readline() returns
  a record that doesn't end in a \n character."

* "I value the equivalence of __next__() and readline()." I think this
  is less important to me than it is to Guido, but it is an argument
  for not putting the separator in the ``readline`` call.

* "I still think you should solve this using a wrapper class (that
  does its own buffering if necessary, and implements the rest of the
  stream protocol for the benefit of other consumers of some of the
  data)." This is the sticking point. I would love to solve it that
  way, but I can't without a change to the ``io`` module that I think
  he doesn't want.

References
==========

.. [#issue1152248] http://bugs.python.org/issue1152248
.. [#thread2014] http://thread.gmane.org/gmane.comp.python.ideas/28310
.. [#thread2005] https://mail.python.org/pipermail/python-list/2005-February/
.. [#gnufind] http://linux.die.net/man/1/find
.. [#bsdfind] http://www.freebsd.org/cgi/man.cgi?find(1)
.. [#rename] https://github.com/ap/rename/blob/master/rename#L179
.. [#awktut] http://www.thegeekstuff.com/2010/01/8-powerful-awk-built-in-variables-fs-ofs-rs-ors-nr-nf-filename-fnr/
.. [#issue14017] http://bugs.python.org/issue14017
.. [#pep-peek] At this point, still an unsubmitted draft, but it can
               be found in [#thread2014]_.
.. [#gawkman] https://www.gnu.org/software/gawk/manual/html_node/Multiple-Line.html
.. [#sedman] http://pubs.opengroup.org/onlinepubs/009695399/utilities/sed.html
.. [#perldoc] http://perldoc.perl.org/perlvar.html
.. [#rubydoc] http://www.ruby-doc.org/core-2.1.2/IO.html#method-i-gets
.. [#nodedoc] http://nodejs.org/api/readline.html#readline_event_line
.. [#cppdoc] http://www.cplusplus.com/reference/string/string/getline/
.. [#posixdoc] http://pubs.opengroup.org/onlinepubs/9699919799/functions/gets.html
.. [#phpdoc] http://us3.php.net/manual/en/function.stream-get-line.php
.. [#javadoc] http://docs.oracle.com/javase/6/docs/api/java/io/BufferedReader.html#readLine()
.. [#dotnetdoc] http://msdn.microsoft.com/en-us/library/system.io.streamreader.readline(v=vs.110).aspx
.. [#haskelldoc] http://hackage.haskell.org/package/base-4.7.0.1/docs/GHC-IO-Handle.html#v:mkFileHandle
.. [#cocoadoc] https://developer.apple.com/library/mac/documentation/Cocoa/Reference/Foundation/Classes/NSFileHandle_Class/Reference/Reference.html#//apple_ref/occ/instm/NSFileHandle/readDataOfLength:
.. [#open] https://docs.python.org/3/library/functions.html#open
.. [#io] https://docs.python.org/3/library/io.html

Copyright
=========

This document has been placed in the public domain.