classification
Title: Efficient reverse line iterator
Type: performance Stage: needs patch
Components: Library (Lib) Versions: Python 3.3
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: belopolsky, benjamin.peterson, christian.heimes, eric.araujo, flox, giampaolo.rodola, jcon, jemerton, mark_t_russell, pitrou, rhettinger
Priority: normal Keywords: patch

Created on 2007-03-10 13:47 by mark_t_russell, last changed 2014-04-15 20:32 by jemerton.

Files
File name Uploaded Description Edit
reverse_file_iterator.diff mark_t_russell, 2007-03-10 13:47 Reverse file iterator implementation review
reverse-file-iterator-20071209.diff mark_t_russell, 2007-12-09 20:36 review
reverse-file-iterator-20071210.diff mark_t_russell, 2007-12-10 20:56 review
readprevline-20140415.diff jemerton, 2014-04-15 20:32 Implementation of BufferedReader.readprevline() review
Messages (12)
msg52152 - (view) Author: Mark Russell (mark_t_russell) Date: 2007-03-10 13:47
This is an implementation of __reversed__ for the TextIOWrapper type from the new IO interface (see http://docs.google.com/Doc?id=dfksfvqd_1cn5g5m).  It is used as:

    import io
    for line in reversed(io.open(filename)):
          ...

It is efficient (only reads a block at a time) but can handle arbitrary length lines.  It is useful for scanning backwards through big log files, but my main reason for submitting it is as a demonstration of the usefulness of the new IO layers - it works by putting a new buffering layer round the RawIOBase object from the open() call.

It's just a proof of concept, but if there's interest in this I'm happy to write unit tests and documentation.

The patch also makes io.BufferedReader support buffering, and adds a very minimal implementation of io.TextIOBase and io.TextIOWrapper (needed to make io.open() work).

msg52153 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2007-04-11 17:40
FWIW, I've wanted something like this for a long time (for scanning log files in reverse).
msg57266 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2007-11-08 17:39
Some people like the feature, Guido isn't against the feature ... It
looks as you have a good chance to get it into Python 3.0. :)

Can you come up with a new patch and unit tests? The io module has
changed a lot since your initial patch.
msg57269 - (view) Author: Mark Russell (mark_t_russell) Date: 2007-11-08 18:04
Sure - I'll do an updated patch at the weekend.
msg58326 - (view) Author: Mark Russell (mark_t_russell) Date: 2007-12-09 20:36
Here's an updated version of the patch.  Changes:

    - Updated to work against current py3k branch (r59441)
    - Added support for universal newlines
    - Added unit tests
    - Added docs

The patch includes documentation for reversed() and __reversed__() (in the 
library and reference manuals respectively) which are independent of the 
reverse lines iterator - I can split those out to separate patch if needed.

I also updated the expected output from test_profile and test_cProfile, 
although I think a better fix would be to filter out the stdlib-related stuff 
from the expected output, as currently these tests break whenever io.py is 
changed.
msg58361 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2007-12-10 19:38
I'd like to see the doc patches separated out and applied to 2.6 --
they'll automatically merge into 3.0 then. Make that a separate bug please.

I like the idea, haven't had time to carefully review the code, but
noticed one oddity:

>>> for line in reversed(open("/etc/passwd")): print(line, end="")

immediately raises

ValueError: I/O operation on closed file
msg58370 - (view) Author: Mark Russell (mark_t_russell) Date: 2007-12-10 20:56
As Guido requested I've split off the generic reversed() and __reversed__()
doc additions to this patch against 2.6: http://bugs.python.org/issue1582

The I/O error from reversed(open("/etc/passwd")) was caused by the inner
TextIOWrapper calling close() (via the inherited IOBase.__del__() method).
I've fixed it by having TextIOReverseIterator keep a reference to the file
object, and added a test case for the bug.

I think it's at least questionable that TextIOWrapper.close() is calling 
buffer.close() on a buffer that it did not create.  I assumed that keeping
a reference to the buffer object would be enough to keep the buffer open,
and I suspect this is likely to trip up others in future.  I think
TextIOWrapper.close() should probably just set a flag (for the use of its 
own closed() method) and rely on reference counting to call close() 
on the buffer object.  If that sounds on the right lines I'm happy to think
about it a bit more and submit a patch.
msg110532 - (view) Author: Mark Lawrence (BreamoreBoy) Date: 2010-07-17 00:23
The OP has done everything asked of him.  There are a lot of positive comments about this request.  Snag is the patch is in python, I understand that io is now written in C.  Could we at this late stage get this into 3.2, or even a minor release of 3.2, or will it have to be deferred until 3.3?
msg110533 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010-07-17 01:18
> The OP has done everything asked of him.

Not quite.  He split out the documentation part of his patch and it was accepted and committed.

Guido raised an issue with the code.  OP raised more questions.  The code was never updated.

Now the patch is out of date because io module has been reimplemented in C.  The patch is still good as a prototype/proof of concept but needs to be updated to apply to _pyio.

The idea is good, but the implementation is not ready.
msg111041 - (view) Author: Mark Russell (mark_t_russell) Date: 2010-07-21 12:24
I'll do a C version of the patch (hopefully in the next week or so).
msg115651 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-09-05 13:58
Suggestions:

- do it on BufferedReader, rather than TextIOWrapper: if you want full-speed scanning of log files, you probably want to open them in binary mode

- rather than implementing a full-blown iterator, you can start with simple primitives: e.g. a method called readprevline() (or even scan_back() if you want to make the stop character(s) configurable)
msg216381 - (view) Author: James Emerton (jemerton) Date: 2014-04-15 20:32
Attached is an implementation of BufferedReader.readprevline(), as suggested by Antoine.

At this point, it seems to be working but I would like to improve the tests when a single result spans multiple chunks. I would particularly like to ensure correct behaviour when a newline ends up as the last byte of a new chunk.
History
Date User Action Args
2014-04-15 20:32:03jemertonsetfiles: + readprevline-20140415.diff
nosy: + jemerton
messages: + msg216381

2014-02-03 19:15:07BreamoreBoysetnosy: - BreamoreBoy
2011-11-12 15:34:19eric.araujosetnosy: + eric.araujo
2011-11-08 19:02:45floxsetnosy: + flox

versions: + Python 3.3, - Python 3.2
2011-11-08 18:33:44giampaolo.rodolasetnosy: + giampaolo.rodola
2011-05-06 02:52:14jconsetnosy: + jcon
2010-09-05 15:33:55gvanrossumsetnosy: - gvanrossum
2010-09-05 13:58:08pitrousetmessages: + msg115651
2010-07-21 12:24:37mark_t_russellsetmessages: + msg111041
2010-07-17 01:18:25belopolskysetversions: + Python 3.2, - Python 3.1
nosy: + belopolsky

messages: + msg110533

stage: patch review -> needs patch
2010-07-17 00:23:07BreamoreBoysetnosy: + BreamoreBoy
messages: + msg110532
2009-04-26 01:12:05ajaksu2setnosy: + pitrou, benjamin.peterson
versions: + Python 3.1, - Python 3.0

type: performance
stage: patch review
2008-01-06 22:29:46adminsetkeywords: - py3k
versions: Python 3.0
2007-12-10 20:56:44mark_t_russellsetfiles: + reverse-file-iterator-20071210.diff
messages: + msg58370
2007-12-10 19:38:34gvanrossumsetnosy: + gvanrossum
messages: + msg58361
2007-12-09 20:36:12mark_t_russellsetfiles: + reverse-file-iterator-20071209.diff
messages: + msg58326
2007-11-08 18:04:48mark_t_russellsetmessages: + msg57269
2007-11-08 17:39:44christian.heimessetnosy: + christian.heimes
messages: + msg57266
2007-08-30 00:22:29gvanrossumsettitle: Efficient reverse line iterator -> Efficient reverse line iterator
versions: + Python 3.0
2007-03-10 13:47:37mark_t_russellcreate