classification
Title: reading individual bytes of multiple binary files using the Python module fileinput
Type: enhancement Stage: needs patch
Components: Library (Lib) Versions: Python 3.5
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: Nosy List: Tommy.Carstensen, josh.r
Priority: normal Keywords:

Created on 2014-03-20 09:39 by Tommy.Carstensen, last changed 2014-03-24 23:18 by josh.r. This issue is now closed.

Messages (6)
msg214195 - (view) Author: Tommy Carstensen (Tommy.Carstensen) Date: 2014-03-20 09:39
This is my first post on bugs.python.org. I hope I abide to the rules. It was suggested to me on stackoverflow.com, that I request an enhancement to the module fileinput here:
http://stackoverflow.com/questions/22510123/reading-individual-bytes-of-multiple-binary-files-using-the-python-module-filein

I can read the first byte of a binary file like this:

    with open(my_binary_file,'rb') as f:
        f.read(1)

But when I run this code:

    import fileinput
    with fileinput.FileInput(my_binary_file,'rb') as f:
        f.read(1)

then I get this error:

    AttributeError: 'FileInput' object has no attribute 'read'

I would like to propose an enhancement to fileinput, which makes it possible to read binary files byte by byte.

I posted this solution to my problem:

    def process_binary_files(list_of_binary_files):

        for file in list_of_binary_files:
            with open(file,'rb') as f:
                yield f.read(1)

        return

    list_of_binary_files = ['f1', 'f2']
    generate_byte = process_binary_files(list_of_binary_files)
    byte = next(generate_byte)
msg214739 - (view) Author: Josh Rosenberg (josh.r) * (Python triager) Date: 2014-03-24 21:44
fileinput's semantics are heavily tied to lines, not bytes. And processing binary files byte by byte is rather inefficient; can you explain why this feature would be of general utility such that it would be worth including it in the standard library?

It's not hard to just get a byte at a time using existing parts:

    def bytefileinput():
        return (bytes((b,)) for line in fileinput.input() for b in line)

There are ways to do similar things without using fileinput at all. But it really depends on your use case.

Giving fileinput a read() method isn't a bad idea assuming some reasonable behavior is defined for the various line oriented methods, but making it iterate binary mode input byte by byte would be a breaking change of limited utility in my view.
msg214741 - (view) Author: Josh Rosenberg (josh.r) * (Python triager) Date: 2014-03-24 21:48
That example should have included mode="rb" when using fileinput.input(); oops. Pretend I didn't forget it.
msg214752 - (view) Author: Tommy Carstensen (Tommy.Carstensen) Date: 2014-03-24 22:32
I read the fileinput code and realized how heavily tied it is to line input.

Will reading individual bytes as suggested not be very memory intensive, if each line is billions of characters?

    def bytefileinput():
        return (bytes((b,)) for line in fileinput.input() for b in line)

I posted my workaround on stackoverflow (see link earlier in tread), which does not make use of the fileinput module at all. After having read through the fileinput code I agree that the module should only support reading lines and this enhancement request should be closed.
msg214758 - (view) Author: Josh Rosenberg (josh.r) * (Python triager) Date: 2014-03-24 23:18
On memory: Yeah, it could be if the file didn't include any newline characters. Same problem could apply if a text input file relied on word wrap in an editor and included very few or no newlines itself.

There are non-fileinput ways of doing this, like I said; if you want consistent performance, you'd probably use one of them. For example, using the two arg form of iter:

    from functools import partial

    def bytefileinput(files):
        for file in files:
            with open(filename, "rb") as f:
                yield from iter(partial(f.read, 1), b'')

Still kind of slow, but predictable on memory usage and not to complex.
msg214759 - (view) Author: Josh Rosenberg (josh.r) * (Python triager) Date: 2014-03-24 23:18
And of course, missed another typo. open's first arg should be file, not filename.
History
Date User Action Args
2014-03-24 23:18:59josh.rsetmessages: + msg214759
2014-03-24 23:18:06josh.rsetmessages: + msg214758
2014-03-24 22:32:04Tommy.Carstensensetstatus: open -> closed
resolution: rejected
messages: + msg214752
2014-03-24 21:48:56josh.rsetmessages: + msg214741
2014-03-24 21:44:39josh.rsetnosy: + josh.r
messages: + msg214739
2014-03-24 15:06:03berker.peksagsetstage: needs patch
versions: + Python 3.5, - Python 3.3, Python 3.4
2014-03-20 09:39:18Tommy.Carstensencreate