PEP: XXX Title: Alternative newlines for file objects Version: $Revision$ Last-Modified: $Date$ Author: Andrew Barnert Status: Draft Type: Standards Track Content-Type: text/x-rst Created: 19-Jul-2014 Python-Version: 3.5 Post-History: Abstract ======== This PEP proposes a means to use alternative newlines in both text and binary files, in place of the usual '\n', '\r', and '\r\n'. Rationale ========= This idea has previously been discussed in (at least) a 2005 thread on python-list [#thread2005]_ (not properly threaded; search for "Canonical way of dealing with null-separated lines?"), issue #1152248 [#issue1152248]_ on the bug tracker, and a 2014 thread on python-ideas [#thread2014]_. tl;dr ----- Python makes it very easy to iterate through lines or other records in a binary or text file if they're delimited by one of the standard newline strings, but very hard to do so if they're delimited by any other string. Adding a way to override the default newline would make this just as easy as in other languages (Perl, awk, etc.) that have similar features. Use cases --------- The Unix ``find`` tool, by default, prints a list of filenames, each on its own line. However, because Unix filenames may contain newline characters, this can be ambiguous. Therefore, it has an option, ``-print0``, that uses the ASCII null character (``'\0'``) in place of the newline character (``\n``). While newlines in filenames are rare on Unix, they may be used for files from other platforms--e.g., newlines were commonplace on classic Mac and Palm OS. Also, even when there's no risk of dealing with newlines in a filename, ``-print0`` is often useful because ``find`` has no way to escape or quote spaces or other special characters that many other tools (like ``xargs`` or ``sh``) may treat as separators. Therefore, both the GNU [#gnufind]_ and BSD [#bsdfind]_ manpages recommend using ``-print0`` in place of the default `-print`` whenever possible. Similarly, there are Windows and Mac file formats that use a Unicode null (``U+0000``) to separate UTF-16 strings, and l10n message catalog formats that use either 8-bit or Unicode nulls for a similar purpose. Some file formats use a single newline as a field separator, and a blank line as a record separator. For example mail-merge address lists are often stored this way. This is also often used as an example in awk tutorials like [#awktut]_, to show off awk's special handling for this case, although it's not clear how often it comes up in real life. While it is of course relatively easy to write code that translates lines into records by splitting the iterable on blank lines, it's even easier to set ``'\n\n'`` as the line separator and just iterate the records directly. The gawk manual [#gawkman]_ suggests using formfeed (``\f``) in file formats that need multiline records that may contain blank lines, although it's not clear how often people actually do this. Tools reading files meant for older systems (e.g., emulator wrapper GUIs) often have to deal with other newlines. For example, early Acorn/BBC systems used ``'\n\r'``, while EBCDIC used ``'\x15'`` (or, in text mode, ``'\x85'``). On issue #1152248, ysj.ray said that he had files that use ``'\t'`` as a record separator. While this probably isn't a common format, it serves as an example that shows that there may be many uncommon and unlikely-seeming formats that users may have to deal with, so we want a solution general enough to fit all of them (rather than, e.g., just adding ``'\0'`` to the list of universal newlines). Can't you just wrap the file? ----------------------------- Sure. In fact, sometimes you *need* to wrap the file, because you're, e.g., getting the output of ``find --print0`` on ``sys.stdin``. The question is, wrap it with what? The obvious solution is something like this: :: def resplit(strings, separator): partialLine = None for s in strings: if partialLine: partialLine += s else: partialLine = s if not s: break lines = partialLine.split(separator) partialLine = lines.pop() yield from lines if partialLine: yield partialLine with open(path) as f: chunks = iter(partial(f.read, 4096), '') lines = resplit(chunks, '\0') lines = (line + '\0' for line in lines) for line in lines: process(line) (You can of course wrap up the file reading with the splitting, wrap up the terminator-appending with the splitting, skip the terminator-appending if you're just going to strip it anyway, etc.) Needless to say, this is not exactly trivial. (In the 2005 thread, it took three tries to get it right, and the final version there still doesn't work for binary files.) It also doesn't give you a file object, it's just a plain old generator that doesn't support the stream protocol. It also means you now have two iterators (the resplit, and the file itself) that are both alive and referencing the same file, but with separate buffers and out-of-sync implicit positions. And it's slow. (From a quick, unscientific test, an _io.TextIOWrapper wrapped in this iterator is slower than a _pyio.TextIOWrapper...) The first of these problems could be solved by just putting something like ``resplit`` in the stdlib (maybe in ``itertools``?), and that could conceivably solve the last as well (by coding it in C), but it's not going to help the other problems. What about writing the wrapper as a subclass of ``TextIOWrapper`` (or as a ``TextIOBase`` implementation that delegates to a ``TextIOWrapper``), with its own ``readline`` override? That would solve all of the problems--but unfortunately, it's at the very least difficult, if not impossible, to do this well. See the bikeshedding section and the separate PEP on adding a ``peek`` method ([#pep-peek]_) for further details. It's worth mentioning that wrapping an already-open file is already necessary today, and will be exactly as necessarily to handle alternative newlines as it is to handle, e.g., changing encodings today. As a side note, it's not as easy as it should be to move the ``buffer`` from one ``TextIOWrapper`` to another (see [issue #14017]_), but that's a less serious issue, and not a new one raised by this proposal. (In fact, exposing the *newline* argument as an attribute, as proposed below, partly solves that existing problem.) What about files that don't come from ``open`` or the ``io`` module? -------------------------------------------------------------------- Most people who ask this are asking about something like this: I see another problem with doing this by modifying the ``open()`` call: it does not work for filehandles creates using other methods such as ``pipe()`` This is just a misconception of how files work in Python. The file descriptors you get back from ``os.pipe()`` are not file objects, they're just integers. These file descriptors don't have a ``readline`` method, or any notion of lines at all; the only way to read from them is to call ``os.read``. If you want to use such a file descriptor as a file object, the way you do it is to pass it call ``open(fd)``. In other words, it *does* work for file handles created using methods such as ``os.pipe()``. However, there are cases where you've got an actual file-like object handed to you by some other module, something that supports much of the file protocol but doesn't actually implement any of the ABCs or otherwise use the ``io`` module. And often, that something won't be sufficiently close to the ABCs to allow wrapping it up. So (assuming something like ``resplit`` isn't good enough for your use case), how does this proposal solve that problem? The answer is simple: it doesn't, and there's no possible way any proposal reasonably could. The module has to be rewritten to give you objects that actually implement one of the ``io`` ABCs, or at least duck type it closely enough, if you want the ``io`` module to help you at all. The good news is that such a change will almost trivially add support for alternative newlines, so there won't be any need to wrap up the result once the change is made. Other languages --------------- Many languages provide a way to change the separator to a single character, an arbitrary string, or a regular expression, either globally, for a specific file, or for a single line. * In AWK [#gawkman]_, setting the ``RS`` variable to any non-empty string or (in GNU awk) regular expression makes that string or regexp the line terminator for the current input file. Setting it to an empty string terminates on a blank line (that is, ``'\n\n'``) in some versions, or on any sequence of one or more blank lines (that is, ``'\n\n+'``) in others. * In sed [#sedman]_, the line terminator is always ``'\n'``. * In Perl [#perldoc]_, the ``$/`` variable, if set, has the exact same effect as gawk's ``RS``; if unset, ``'\n'`` is the line terminator. This variable can be set globally or locally, with the usual Perl scoping rules. * Ruby [#rubydoc]_, ``gets`` and similar methods take an optional argument ``sep``. If ``sep`` is a non-empty string, that string is used as a line terminator; if it's an empty string, any sequence of one or more blank lines is the terminator; if it's nil, the entire file is read in. The global variable ``$/`` provides a default value for ``sep``. * C/POSIX [#posixdoc]_ ``gets`` uses the platform-specific default newline sequence for text files, ``'\n'`` for binary files. (C does not require ``'\n'`` to mean ``'\x0a'``, but POSIX does.) No alternatives are provided, but the ``getc`` macro is intended to be fast enough to loop character by character, and it's reasonably well known how to use ``fscanf`` to read up to an arbitrary character. * PHP [#phpdoc_] ``fgets`` works like C, but it also provides a ``stream_get_line`` function that takes an arbitrary string ``$ending`` argument. * Node.js [#nodedoc]_ always uses ``'\n'`` as the line terminator. * In C++ ``iostreams`` [#cppdoc_], the ``getline`` function takes any single character (``widen``-able to the appropriate type) as a delimiter, defaulting to ``'\n'``. (There is no way to handle multiple-character separators, but you can use text-mode translation to deal with the special case of ``'\r\n'``.) However, the ``iostreams`` library was intentionally designed as a number of layered components that expose just enough information that it should be easy to write and plug in new functionality. * Java [#javadoc]_ ``readline`` always accepts any of ``'\n'``, ``'\r'``, or ``'\r\n'`` as the line terminator. However, as with C++, the ``java.io`` classes are intentionally designed to let you build and stack filters to add your own functionality. * .NET [#dotnetdoc]_ ``Readline`` always uses ``'\r\n'`` as the line terminator. * Haskell [#haskelldoc]_ mkFileHandle takes a ``NewlineMode`` at file construction time. However, you typically read the whole file as lazy string, and use lazy split functions and the like rather than calling anything like ``readline``. * Cocoa [#cocoadoc]_ has no line reading or buffering; you're expected to do it manually or use C stdio (or just read the whole file in at once, which there's a zillion ways of doing). Specification ============= Except for the ``open`` function [#open]_, all of the changes are within the [#io]_ module. (In CPython, this is actually the ``_io`` package of C extension modules and the ``_pyio`` Python module; the ``io`` module itself is just a wrapper, which won't need any changes.) ``open`` -------- The ``open`` function currently takes a *newline* argument, but raises if given a non-``None`` value for a binary file. This will change to just pass the value along. The documentation currently says: *newline* controls how *universal newlines* mode works (it only applies to text mode). It can be None, '', '\n', '\r', and '\r\n'. It works as follows: When reading input from the stream, if *newline* is ``None``, universal newlines mode is enabled. Lines in the input can end in ``'\n'``, ``'\r'``, or ``'\r\n'``, and these are translated into ``'\n'`` before being returned to the caller. If it is ``''``, universal newlines mode is enabled, but line endings are returned to the caller untranslated. If it has any of the other legal values, input lines are only terminated by the given string, and the line ending is returned to the caller untranslated. When writing output to the stream, if *newline* is ``None``, any ``'\n'`` characters written are translated to the system default line separator, ``os.linesep``. If *newline* is ``''`` or ``'\n'``, no translation takes place. If *newline* is any of the other legal values, any ``'\n'`` characters written are translated to the given string. It will instead say: *newline* controls the line separator, and (for text mode) how *universal newlines* mode works. It can be None, or any string *(for text mode) or byte string (for binary mode). It works as *follows: When reading input from the stream, if *newline* is ``None``, the behavior depends on the mode. For text mode, universal newlines mode is enabled. Lines in the input can end in ``'\n'``, ``'\r'``, or ``'\r\n'``, and these are translated into ``'\n'`` before being returned to the caller. For binary mode, lines can only end in ``b'\n'``. If it is an empty string, in text mode, universal newlines mode is enabled, while in binary mode, only ``b'\n'`` is a line ending. In either mode, line endings are returned to the caller untranslated. If it has any of the other legal values, input lines are only terminated by the given string, and the line ending is returned to the caller untranslated. When writing output to the stream, for binary files, *newline* is ignored. For text files, it controls output translation. if *newline* is ``None``, any ``'\n'`` characters written are translated to the system default line separator, ``os.linesep``. If *newline* is ``''`` or ``'\n'``, no translation takes place. If *newline* is any of the other legal values, any ``'\n'`` characters written are translated to the given string. In particular, for text modes, the *newline* argument is passed only to the ``TextIOWrapper`` (as is already true), and for binary modes it's passed only to the ``BufferedReader``, ``BufferedWriter``, or ``BufferedRandom`` (instead of raising). ``IOBase`` ---------- ``IOBase`` will grow a new attribute: * ``newline`` (usually the *newline* value passed to ``open``) will be used for recognizing line terminators. This is not part of the ``IOBase`` API and may not exist in some implementations. For all of the concrete classes in ``io``, ``newline`` will be present, and read-only, and immutable (e.g., ``bytes`` or ``str``). ``IOBase`` is also where ``readline`` is documented, and it also provides a default mixin implementation that's used by all of the binary file types in the module (and many elsewhere). The documentation currently says: The line terminator is always ``b'\n'`` for binary files; for text files, the *newline* argument to ``open()`` can be used to select the line terminator(s) recognized. It will instead say: The ``newline`` attribute, if present, will select the line terminator(s) recognized, as explained in ``open()``. If not present, the default is ``b'\n'`` for binary files, or any of ``'\n'``, ``'\r'``, or ``'\r\n'`` for text files. The changes to the implementation of this function are pretty obvious: search for ``self.newline`` if present, ``b'\n'`` otherwise, instead of always ``b'\n'``, and make sure to add ``len(newline)`` instead of ``1`` when found. (Both the Python and C implementations have clever tricks that implicitly rely on the ``1``. The Python version uses ``readahead.find(b"\n") + 1) or len(readahead)``, assuming that ``-1 + 1`` is falsey; the C version uses a ``++`` in the line that checks for ``'\n'`` so it doesn't have to increment if found. Although these shortcuts do make the code a little briefer, they don't seem to have any performance benefits, and make it a little harder to understand, so there's no great loss in giving them up.) Binary file classes ------------------- ``RawIOBase`` and ``BufferedIOBase`` need no changes. ``FileIO``, ``BufferedReader``, and ``BytesIO`` will now take a *newline* argument, convert it to ``bytes`` if necessary, store it in struct member ``self->newline`` (C) or attribute ``self._newline`` (Python), and expose it as a read-only property ``self.newline``. ``BufferedWriter`` will take a *newline* argument and ignore it. ``BufferedRWPair`` will take a *newline* argument and pass it to the reader and writer objects it constructs. ``BufferedRandom`` will take a *newline* argument and pass it to its reader and writer supers. Text file classes ----------------- ``TextIOBase`` needs no changes. ``TextIOWrapper`` currently does almost everything we want, except that the constructor raises a ``ValueError`` if the *newline* argument is a string, but ``not in ('', '\n', '\r', '\r\n')``. This check will be removed. For consistency, it will also expose ``self->readnl`` (C) or ``self._readnl`` (Python) as read-only property ``self.newline``. ``StringIO`` inherits all relevant behavior from ``TextIOWrapper``, and needs no changes. Outside the ``io`` module ------------------------- There are many other classes, both in the stdlib and elsewhere, that implement the stream API, and most of them would of course benefit from adding the same *newline* parameter. But none of them are required to do so. Some classes may just work with the changes; others will not. But those that don't, it will be obvious that they either don't have any way to provide a *newline* argument, or raise when given one. At any rate, changes to the rest of the stdlib are outside the scope of this proposal; if needed, they can be done as separate enhancements later (just as they will be in third-party libraries). The shed ======== Where to specify the alternative newline ---------------------------------------- This is the biggest area of contention. There are a few places the newline could be specified: * A magic global or local variable (as in awk, Perl, and Ruby). * In each call to ``readline`` and friends (as in Ruby, PHP, and C++). * In an attribute of the file object (as... nowhere?). * At file object constructor time (as in Haskell and C++ with Boost). The first is obviously a non-starter for Python. Adding an optional parameter to the ``readline`` method seems attractive at first, but has two serious problems. First, you can't pass the same parameter to ``__iter__``. Second, this would mean that the hundreds of existing file objects that either implement one of the ``IOBase`` ABCs or duck type the implicit stream protocol are no longer valid file objects--and, worse, there's no good way for a consumer to check whether a given file object can take a *newline* parameter to its ``readline`` method. Adding new methods--e.g., ``readrecord``, ``iterrecords``, and ``readrecords`` to parallel ``readline``, ``__iter__``, and ``readlines``--solves the first problem, and amelioriates the second (at least you can check with ``hasattr`` or EAFP), but it's still a serious break with the API to add new methods that existing file objects don't support. (Also, while this is a much more minor problem, for ``iterrecords`` to work, unlike ``__iter__``, it has to return a new iterator object that references the file.) Adding an attribute to the file object avoids some of these problems. However, changing the line terminator on the fly seems to be at least as potentially confusing to the user, and as limiting to the implementation, as changing the encoding; Python doesn't allow the latter, so it shouldn't allow the former. Adding a parameter to the constructors solves all of these problems. Construction is not part of the stream protocol; nobody expects that opening a file on disk, wrapping a transport in a file, opening a file within an archive, etc. will have the same interface, and in general, they don't. The ``__iter__`` method will work exactly as expected, with no changes. The difference between adding a *newline* argument to ``readline`` instead of the constructor may not seem obvious, but consider where you'd find the problem if you tried to use a file-like object that didn't support it. With the constructor change, before even writing your code, while looking up the construction syntax for creating the object, you'd discover that it has no *newline* argument. With the ``readline`` change, you'd write the code and then, later, get an error far away in your code (or in some library that you use, or in some customer's code that uses yours) because you have an object that looks like a file object, and claims to be a ``BufferedReader``, but has the wrong signature on its ``readline`` method. Text files or binary files? --------------------------- During the previous discussions, some people have been convinced that this feature is obviously needed only for binary files, and makes no sense for text files--after all, files with ``'\0'`` characters are obviously not text. However, that doesn't really work. The output of ``file`` is a list of filenames, and it makes perfect sense to open it with ``encoding=sys.getfilesystemencoding()`` as a text file. Adding the ``-print0`` argument shouldn't change that; it's still a list of filenames in the same encoding, they're just separated by an encoded null instead of an encoded newline. (It's true that ``find`` is separating encoded filenames with the ``'\0'`` byte, not whatever that happens to decode to, but it's no accident that the ``'\0'`` byte always means null in any charset that can be used by a filesystem under Unix.) Also, consider the case of UTF-16-LE strings with ``U+0000`` terminators. If there's an ASCII character followed by a newline, the second half of the ASCII character and the first half of the newline will be ``'\0\0'``, so the line will be split one character early, leaving half a character in each line. There's really no sensible way to read such a file except by decoding before splitting lines--that is, by reading it as a text file. Conversely, a few people were convinced that this feature only makes sense for text files, because binary files don't do newline conversion--or, more fundamentally, they don't have a concept of lines, so how can they have a concept of newlines? The fact that binary files have a ``readline`` method in the first place already implies that the concept makes sense. Also, there are plenty of "binary" file formats that are basically ASCII text. Consider an HTTP response: it's an ASCII status line and headers with ``'\r\n'`` newlines, followed by a blank line, then an arbitrary body (which is often a text file in a different encoding, and with different newlines). While you usually can get away with parsing it as a binary file with ``'\n'`` line endings and stripping off the ``'\r'`` (which many people do today in Python--and, for that matter, in PHP), it's obviously *more* reasonable to use the proper line ending, not *less*. Output newline translation -------------------------- Opening text files with a specific *newline* value not only changes the line separator for input, it also causes any ``'\n'`` characters to be translated to the given string for output. Should the same be true for binary files? On the one hand, it seems simpler to make the behavior identical for binary and text files than to try to explain how they're different. On the other hand, it doesn't seem like a good idea to add write translation for binary files, which have never had them, and which seems conceptually wrong. This proposal chooses the latter, because it's a smaller and hopefully less surprising change. Input newline translation ------------------------- Some people have suggested that getting strings or byte strings back from ``readline`` that don't end in ``'\n'`` could be confusing. After all, the fact that the file was opened with ``newline='\0'`` could be very distant in the code from where the ``readline`` call happens. The first thing to notice is that the same problem already exists. If you open a file with ``newline='\r'``, your lines will end with ``'\r'``, not ``'\n'``. Code that can't deal with that will have the exact same problem dealing with ``'\0'``. The fact that some code gets away with it by sloppily calling ``rstrip()`` on each line (which happens to work on ``'\r'``, along with truncating any other trailing whitespace, which is probably a bug...) doesn't mean we want to encourage that. Multiple characters? -------------------- A few people have raised the issue that multiple-character separators are both harder to deal with in the implementation, and harder to think about for the user. However, ``'\r\n'`` pretty obviously needs to be handled. Also, ``'\n\n'`` is a common enough use case for awk, Perl, and Ruby all to have special-casing to deal with it. Stream stacks ------------- A typical file object is a stack of streams. For example, opening a file in ``"r"`` mode constructs a ``FileIO`` (a ``RawIOBase`` instance), wraps that in a ``BufferedReader`` (a ``BufferedIOBase`` instance), then wraps that in a ``TextIOWrapper`` (a ``TextIOBase`` instance). So, when opening a file in ``"r"`` mode with a non-default *newline* value, should the newline be passed to all three objects (or, rather, ``newline.encode(encoding)`` passed to the first two), so that ``f.buffer.readline()`` or ``f.buffer.raw.readline()`` will use the same separator as ``f.readline()``? While that seems attractive at first, it may not be a good idea. Consider the example of null-separated UTF-16 again; passing ``b'\0\0'`` to the ``FileIO`` and ``BufferedReader`` is not going to work as expected. Also, it's extra complexity that will probably rarely if ever be used. Making delegation/inheritance easier instead -------------------------------------------- As Guido suggested twice, and others suggested as well, the ideal way to solve this problem would be either: * Rewrap the file in a subclass of ``TextIOWrapper`` or ``BufferedReader`` that adds the extra functionality (overridden ``readline`` with or without an extra argument, separate ``readuntil``, whatever you prefer). * Wrap the file in a class that adds the extra functionality and delegates everything else to ``TextIOWrapper`` or ``BufferedReader``. Unfortunately, the design of the ``io`` module makes that difficult. Briefly, the main problem is that ``TextIOWrapper.readline`` needs access to the internals of its buffer, and there doesn't seem to be any remotely simple or efficient way to implement ``readline`` short of reimplementing all of ``TextIOWrapper``. And of course in CPython, you'd have to do at least a large part of that in C for performance; if the pure-Python implementation of ``io.TextIOWrapper`` was too slow to be acceptable, you're probably not going to do significantly better. See [#pep-peek]_ for a possible solution to this problem, and further discussion. Unfortunately, Guido seems to be strongly against this idea. (For example: "I never meant to suggest anything that would require pushing back data into the buffer.") So, I haven't tried to push it. Also, even ignoring Guido, fundamentally changing the API of the ``io`` module is obviously not something to be undertaken lightly; if it's worth doing at all, it's probably worth re-evaluating the design to make sure ``peek`` is really sufficient for everything that might come up in the future, rather than just solving this one use case and possibly having to change the whole design again two years from now. Storing the newline value ------------------------- Most binary file types inherit their ``readline`` implementation from ``IOBase``, so ``IOBase`` will need to be able to access the *newline* value passed to the constructor. Since the ABCs have no public constructor, the obvious way to make the *newline* value available is to expose it as a member, which ``readline`` will use if present (defaulting to ``b'\n'`` if not). There are already a number of such members and methods defined in the ABCs' documentation. To prevent users from changing this value in mid-stream, it should be a read-only attribute, and also immutable, but of course there's no need for code to verify that for third-party classes (consenting adults and all that). So, the concrete binary classes in ``io`` will just store the constructor argument in a private variable, copying it to ``bytes`` if given a ``bytearray`` or other mutable bytes-like object, and expose it as a read-only property. ``TextIOWrapper`` does not inherit its ``readline``, but for consistency it might as well change to using a public ``newline`` attribute instead of a private variable as it does today. This also solves part of issue #14017, as mentioned earlier. Simplified rewrapping --------------------- Since adding a ``newline`` attribute partially solves the problem of rewrapping files, and the use cases for this change may sometimes require rewrapping files, this might be a perfect time to completely solve the problem: add the other missing attribute, ``write_through``, or add the ``rewrap`` method suggested in issue #14017 (and also add it to the ``BufferedFoo`` classes, presumably). While those are both good ideas, they don't really seem to be in scope for this change. Fixing the entire stdlib ------------------------ As long as there are third-party file-like objects, there will be file-like objects that don't support this new behavior. That's why it was important to make the change outside of the API in the first place. So, I don't think it's necessary to fix all file-like objects in the stdlib. Adding *newline* support to any such type would be an enhancement that could be done independently, for the stdlib as for any third-party project. But, in case others disagree, I made a quick survey of classes that inherit one of the ``io`` classes or implement ``readline`` to get an overview of the situation. All the ones I looked at fall into one of these categories: * Call ``open`` (or some other function like ``socket.makefile`` that would presumably already be handled) and delegate ``readline`` to the result (e.g., ``tempfile.NamedTemporaryFile``): should just work, except that some explicitly validate the *newline* argument, and need to stop. * Inherit from ``TextIOWrapper`` and use its ``readline`` implementation: should just work. * Inherit from one of the binary concrete classes or ABC mixins and use its ``readline`` (e.g., ``socket.makefile``): just need to take a *newline* parameter and pass it along. * As above, but override ``readline`` to add a fast-path optimization that's used in certain cases (e.g., ``bz2.BzipFile``): also need to skip the fast path ``if self.newline is not None``. (By the way, from a quick test, the fast path is not actually faster here, and I'm guessing the same is true for any similar class that provides a working ``peek`` method, especially those implemented in Python. Of course that probably wasn't true in 3.0, but I'm guessing nobody tested after the ``io`` overhaul. So in some cases, it might be worth just removing the override entirely.) * As above, but provide a more complicated override of ``readline`` (e.g., ``zipfile.ZipExtFile``): every example I've found is weird or buggy in some way--e.g., tries to implement 2.x-style universal newlines even on binary files. I'm not sure what it would even mean to add *newline* support to such a class. Should it try to fit in with the existing text-like behavior on binary files by treating *newline* the way text files do? Turn off the universal newline support? * Not used as a file, or only used internally within the module (e.g., ``email.feedparser``): no change needed. * Only somewhat file-like (e.g., ``codecs.StreamReader``, ``fileinput``): obviously these aren't going to able to inherit behavior from the ``io`` classes, and it's arguable whether they should. That being said, at least for ``fileinput``, a *newline* argument could be useful, and wouldn't be that hard to implement. (It just calls ``open`` on each file and reads its lines normally.) The only exception I found was ``gzip.GzipFile``, which implements the ``io`` ABCs, but provides its own custom ``readline`` from scratch rather than delegating to the super. I'm not sure this is actually necessary, but, if it is, that class won't gain the *newline* feature except by implementing it manually. What about just adding '\0'? ---------------------------- Multiple people suggested that since '\0' is the main example everyone keeps coming up with, maybe we just need to add that to the list of valid newlines, rather than open it up to all possible strings. First, as shown at the top, there are other examples; '\0' may be the first one that comes to mind, but it's not the only one. Second, it doesn't really simplify things, either in the documentation or in the implementation. We still need to add alternative newline support for binary files; we still need to change any text files that check their *newline* argument before passing it on to the super class, or that do ``readline`` manually; we still need to (and already can) handle multi-character newlines because of ``'\r\n'``; etc. Finally, adding just ``'\0'`` seems more likely to add to confusion. For example, if there are exactly four valid non-empty values for *newline*, and three of those four are also the endings handled by universal newlines, a reader could easily be forgiven for expecting ``'\0'`` to also be handled by universal newlines. If, instead, any non-empty string is allowed for *newline*, no one will be confused by the list of three universal-newlines endings. Why not a different parameter instead of reusing *newline*? ----------------------------------------------------------- For text files, *newline* already controls how lines are terminated. If we added another parameter that did the exact same thing, how would they interact? If we were redesigning the ``io`` module from scratch, maybe it would be better to separate out the universal-newlines flag and the input line terminator string. But there doesn't seem to be a clean way to make that change without breaking all existing uses of *newline*. Possibly-relevant BDFL comments =============================== Guido said that he doesn't have time to follow the discussion, but wants to make sure all of his concerns are met. So, this section lists the concerns he's raised, in hopes that readers can point out any that haven't been answered. * "I don't think it is reasonable to add a new parameter to readline(), because streams are widely implemented using duck typing -- every implementation would have to be updated to support this." Later comments echoed the same idea; asking everyone to add an additional method to their file-like object classes like ``readuntil``, or to modify an existing method like ``readline``, is "unreasonable", and also breaks an established API. It's hard to argue with this one, which is exactly why the proposal suggests puting the parameter outside the API, in the ``open``/``__init__`` methods, instead of changing the API. (Credit to Alexander Heger for first suggesting this solution on the email thread, and to R. David Murray for suggesting the same thing on the bug tracker.) * "I don't like changing the meaning of the newline argument to open (and it doesn't solve enough use cases any way)." This is discussed above; any separate new feature would have to interact in some hard-to-explain way with the existing feature, and I think that would just mean more confusion, not less. (For the parenthetical, I'm not sure what use cases it fails to solve.) * The newline argument is pretty obscure and not very well known. ("Well, I had to look up the newline option for open(), even though I probably invented it.") Is it a good idea to add functionality in a place nobody knows to look? I don't have a great answer here, except that a new parameter isn't likely to be any more discoverable than an existing one. (Also see the previous question.) * "I personally think it's preposterous to use \0 as a separator for text files (nothing screams binary data like a null byte :-)." This is discussed above, but consider the paradigm case of a bunch of UTF-8 filenames or, worse, UTF-16-LE translation strings separated by ``'\0'`` instead of ``'\n'`` in case some of them have embedded ``'\n'`` characters. It's clearly text, and trying to process it as binary data makes things more difficult. * "I don't think it's a big deal if a method named readline() returns a record that doesn't end in a \n character." * "I value the equivalence of __next__() and readline()." I think this is less important to me than it is to Guido, but it is an argument for not putting the separator in the ``readline`` call. * "I still think you should solve this using a wrapper class (that does its own buffering if necessary, and implements the rest of the stream protocol for the benefit of other consumers of some of the data)." This is the sticking point. I would love to solve it that way, but I can't without a change to the ``io`` module that I think he doesn't want. References ========== .. [#issue1152248] http://bugs.python.org/issue1152248 .. [#thread2014] http://thread.gmane.org/gmane.comp.python.ideas/28310 .. [#thread2005] https://mail.python.org/pipermail/python-list/2005-February/ .. [#gnufind] http://linux.die.net/man/1/find .. [#bsdfind] http://www.freebsd.org/cgi/man.cgi?find(1) .. [#rename] https://github.com/ap/rename/blob/master/rename#L179 .. [#awktut] http://www.thegeekstuff.com/2010/01/8-powerful-awk-built-in-variables-fs-ofs-rs-ors-nr-nf-filename-fnr/ .. [#issue14017] http://bugs.python.org/issue14017 .. [#pep-peek] At this point, still an unsubmitted draft, but it can be found in [#thread2014]_. .. [#gawkman] https://www.gnu.org/software/gawk/manual/html_node/Multiple-Line.html .. [#sedman] http://pubs.opengroup.org/onlinepubs/009695399/utilities/sed.html .. [#perldoc] http://perldoc.perl.org/perlvar.html .. [#rubydoc] http://www.ruby-doc.org/core-2.1.2/IO.html#method-i-gets .. [#nodedoc] http://nodejs.org/api/readline.html#readline_event_line .. [#cppdoc] http://www.cplusplus.com/reference/string/string/getline/ .. [#posixdoc] http://pubs.opengroup.org/onlinepubs/9699919799/functions/gets.html .. [#phpdoc] http://us3.php.net/manual/en/function.stream-get-line.php .. [#javadoc] http://docs.oracle.com/javase/6/docs/api/java/io/BufferedReader.html#readLine() .. [#dotnetdoc] http://msdn.microsoft.com/en-us/library/system.io.streamreader.readline(v=vs.110).aspx .. [#haskelldoc] http://hackage.haskell.org/package/base-4.7.0.1/docs/GHC-IO-Handle.html#v:mkFileHandle .. [#cocoadoc] https://developer.apple.com/library/mac/documentation/Cocoa/Reference/Foundation/Classes/NSFileHandle_Class/Reference/Reference.html#//apple_ref/occ/instm/NSFileHandle/readDataOfLength: .. [#open] https://docs.python.org/3/library/functions.html#open .. [#io] https://docs.python.org/3/library/io.html Copyright ========= This document has been placed in the public domain.