classification
Title: can't specify newline string for readline for binary files
Type: enhancement Stage:
Components: Library (Lib) Versions: Python 3.4
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: akira, martin.panter, pitrou, r.david.murray, susurrus, v+python
Priority: normal Keywords:

Created on 2013-01-30 18:32 by susurrus, last changed 2014-09-05 02:51 by martin.panter.

Messages (9)
msg180984 - (view) Author: Bryant (susurrus) Date: 2013-01-30 18:32
When opening binary files in Python 3, the newline parameter cannot be set. While this kind of makes sense, readline() can still be used on binary files. This is great for my usage, but it is doing universal newline mode, I believe, so that any \r, \n, or \r\n triggers an EOL.

The data I'm working with is mixed ASCII/binary, with line termination specified by \r\n. I can't read a line (even though that concept occurs in my file) because some of the binary data includes \r or \n even though they aren't newlines in this context.

The issue here is that if the newline string can't be specified, readline() is useless on binary data, which often uses custom EOL strings. So would it be reasonable to add the newline parameter support to binary files? If not, then shouldn't readline() throw an exception when used on binary files?

I don't know if it's helpful here, but I've written a binary_readline() function supporting arbitrary EOL strings:

def binary_readline(file, newline=b'\r\n'):
    line = bytearray()
    newlineIndex = 0
    while True:
        x = file.read(1)
        if x:
            line += x
        else:
            if len(line) == 0:
                return None
            else:
                return line
        # If this character starts to match the newline string, start that comparison til it matches or doesn't.
        while line[-1] == newline[newlineIndex]:
            x = file.read(1)
            if x:
                line += x
            else:
                return line
            newlineIndex += 1
            if newlineIndex == len(newline):
                return line
               
        # We failed checking for the newline string, so reset the checking index
        newlineIndex = 0
msg180988 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-01-30 19:18
If you are reading in binary mode, then all readline does is get you the next \n terminated chunk of data, which is a convenience in some circumstances.  You have to do all the newline handling yourself.  Otherwise it isn't a binary read.

I think the "right way" to do what you want is to write a custom subclass of one of the IO classes.  Should such a subclass be added to Python?  That's an interesting question.  A first step toward an answer might be to post it as a recipe on the ActiveState site and see how many people find it useful.
msg180994 - (view) Author: Bryant (susurrus) Date: 2013-01-30 21:42
I'm not terribly worried about the "right" way for me to deal with my code, but that Python, in this instance, is inconsistent. While it doesn't want you to apply the concept of a "line" to a binary file in that it prevents you from specifying an EOL string, it does allow you to read that file as lines.

So my question is why shouldn't I be able to specify a newline of b"\r\n" and then use readline() on my binary file? I don't see why that concept shouldn't be applied here when it's definitely applicable in a lot of cases (any binary log format).

To resolve this I really think there're two options to maintain the consistency of Python's approach:
 1) Have readline() error out for binary-mode files.
 2) Allow specifying a byte-string as the newline string for binary files that readline() would then use.
msg181099 - (view) Author: Glenn Linderman (v+python) * Date: 2013-02-01 19:20
I think Bryant's request is reasonable, for consistency in functionality. If line oriented operations are allowed on binary files, then a binary newline value should be permitted at the time of open.

I think, for handling binary files, that it would also be interesting to have a version of readline that takes a binary newline file as a parameter to each readline call, because in binary files, the concept of newline can vary from section to section of the file... here, null-terminated records, there CR LF terminated encoded text records, elsewhere fixed-length records, and another place might have records delimited by some binary token of one or more bytes.  Readline with a newline parameter could be useful in three of those cases, read in the fixed-length case.  But this paragraph would be a new feature.

However, simpler binary files, which may have only one type of "terminated" records, could effectively use the operations Bryant is suggesting, which seems quite reasonable to me, along with a mix of read calls for non-delimited data, fixed-length data, or data requiring complex logic to decode.
msg181100 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-02-01 19:31
Anything we do here is a new feature.

I have no objection to adding features in this area myself, but I will note that I was shot down for proposing (in another issue) that the newline attribute for text files be allowed to be an arbitrary string.
msg181101 - (view) Author: Bryant (susurrus) Date: 2013-02-01 19:39
While my original description of this issue discussed arbitrary strings, I'd like to limit the scope of this issue down to just supporting the newline parameter to builtin.open() for binary files, just as it's supported for regular files. This adds no real new functionality, just makes the treatment of the concept of a "line" consistent between binary files and regular text files, since that concept *does* exist in many cases in binary files.

Given that everyone here seems to think this is at least reasonable at this point, what would the next step be given this is somewhere between a library fix and a feature addition? I haven't contributed to Python before, but the developer FAQ mentions either python-ideas or a PEP, or should this move right towards a patch?
msg181104 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-02-01 20:03
Well, it's a feature by our policy, since it currently works as documented.

Probably the first thing would be to get the opinion of someone who works on the IO module, so I've nosied Antoine.  Note that this was obviously a conscious design decision, as it is documented clearly.

Another possible first step is the one I suggested: posting a recipe on the recipe site and seeing if there is any uptake.  Python-ideas would be another option.  For this level of change, I don't believe a PEP is required :)
msg181131 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-02-01 23:48
I think readline() is quite clear in its current intent (it reads a line, which conventionally is delimited by '\n').
We mau want to add another method, say read_until_delimiter(), but it's not the same as readline() ;-) I think it would be best for it to be discussed, say on python-ideas (*), before any decision is made.

(*) http://mail.python.org/mailman/listinfo/python-ideas
msg223996 - (view) Author: Akira Li (akira) * Date: 2014-07-25 20:22
Related issue #1152248: Add support for reading records with 
arbitrary separators to the standard IO stack

It suggests to extend the newline support for both text and
binary files.
History
Date User Action Args
2014-09-05 02:51:16martin.pantersetnosy: + martin.panter
2014-07-25 20:22:35akirasetnosy: + akira
messages: + msg223996
2013-02-01 23:48:02pitrousetmessages: + msg181131
2013-02-01 20:03:29r.david.murraysetnosy: + pitrou
messages: + msg181104
2013-02-01 19:39:56susurrussetmessages: + msg181101
2013-02-01 19:31:03r.david.murraysetmessages: + msg181100
versions: + Python 3.4, - Python 3.3
2013-02-01 19:20:31v+pythonsetnosy: + v+python
messages: + msg181099
2013-01-30 21:42:42susurrussetmessages: + msg180994
2013-01-30 19:18:58r.david.murraysettype: behavior -> enhancement

messages: + msg180988
nosy: + r.david.murray
2013-01-30 18:32:05susurruscreate