Issue1152248
Created on 2005-02-26 07:24 by ncoghlan, last changed 2012-08-20 05:46 by ncoghlan.
| Messages (31) | |||
|---|---|---|---|
| msg61179 - (view) | Author: Nick Coghlan (ncoghlan) * ![]() |
Date: 2005-02-26 07:24 | |
There is no canonical way to iterate through a file on chunks *other* than whole lines without reading the whole file into memory. Allowing the separator to be specified as an argument to file.readlines and file.xreadlines would greatly simplify the task. See here for an example interface of the useful options: http://mail.python.org/pipermail/python-list/2005-February/268482.html |
|||
| msg61180 - (view) | Author: Georg Brandl (georg.brandl) * ![]() |
Date: 2005-02-26 07:38 | |
Logged In: YES user_id=1188172 I don't know whether (x)readlines is the right place, since you are _not_ operating on lines. What about (x)readchunks? |
|||
| msg61181 - (view) | Author: Douglas Alan (nessus42) | Date: 2005-02-28 18:57 | |
Logged In: YES user_id=401880 In reply to birkenfeld, I'm not sure why you don't want to call lines separated with an alternate line-separation string "lines", but if you want to call them something else, I should think they should be called "records" rather than "chunks". |>oug |
|||
| msg61182 - (view) | Author: Raymond Hettinger (rhettinger) * ![]() |
Date: 2005-06-27 04:25 | |
Logged In: YES user_id=80475 The OPs request is not a non-starter. There is a proven precedent in AWK which allows programmer specifiable record separators. |
|||
| msg61183 - (view) | Author: Nick Coghlan (ncoghlan) * ![]() |
Date: 2005-06-27 09:29 | |
Logged In: YES user_id=1038590 As Douglas Alan's sample implementation (and his second attempt [1]) show, getting this right (and reasonably efficient) is actually a non-trivial exercise. Leveraging the existing readlines infrastructure is an idea worth considering. [1] http://mail.python.org/pipermail/python-list/2005-February/268547.html |
|||
| msg61184 - (view) | Author: Skip Montanaro (skip.montanaro) * ![]() |
Date: 2005-06-27 11:22 | |
Logged In: YES user_id=44345 Seems the most likely place you'd want to use this is to select a non- native line ending in a situation where you didn't want to use universal newlines (select \r as a line ending on Unix, for example, and allow \n to just be another character). In that case they'd clearly still be lines, so embellishing the normal line reading machinery without adding a new method would be most appropriate. |
|||
| msg63060 - (view) | Author: Facundo Batista (facundobatista) * ![]() |
Date: 2008-02-27 02:41 | |
Raymond disapproved it, Skip discouraged it, and Nick didn't push it any more, all more than two years ago. Nick, please, if you feel this is worthwhile, raise the discussion in python-dev. |
|||
| msg63067 - (view) | Author: Raymond Hettinger (rhettinger) * ![]() |
Date: 2008-02-27 08:48 | |
For the record, I thought it was a reasonable request. AWK has a similar feature. The AWK book shows a number of example uses. Google's codesearch shows that the feature does get used in the field: http://www.google.com/codesearch?q=lang%3Aawk+RS&hl=en I think this request should probably be kept open. |
|||
| msg63068 - (view) | Author: Facundo Batista (facundobatista) * ![]() |
Date: 2008-02-27 11:08 | |
Sorry, I misunderstood you. I assign this to myself to give it a try. |
|||
| msg63134 - (view) | Author: Nick Coghlan (ncoghlan) * ![]() |
Date: 2008-02-29 11:58 | |
The mail.python.org link I posted previously is broken. Here's an updated link to the relevant c.l.p. thread: http://mail.python.org/pipermail/python-list/2005-February/310020.html From my point of view, I still think it's an excellent idea and would be happy to review a patch, but I'm unlikely to get around to implementing it myself. Also keep in mind that we now have the option of doing this only for the new io module in Python 3.0 - it may be easier to do that and implement something in pure Python rather than having to deal with the 2.x file implementation. (P.S. I found the double negative in Raymond's original comment a little tricky to parse even as a native English speaker. I would also take Skip's comment as merely discouraging adding a completely new method rather than the original idea) |
|||
| msg64084 - (view) | Author: Facundo Batista (facundobatista) * ![]() |
Date: 2008-03-19 18:52 | |
I took a look at it... It's not as not-complicated as I original thought. The way would be to adapt the Py_UniversalNewlineFread() function to *not* process the normal separators (like \n or \r), but the passed one. A critical point would be to handle more-than-1-byte characters... I concur with Nick that this would better suited for Py3k. So, I'm stepping down from this, and flagging it for that version. |
|||
| msg82188 - (view) | Author: Nick Coghlan (ncoghlan) * ![]() |
Date: 2009-02-15 23:58 | |
Any further work on this should wait until the io-in-c branch has landed (or at least be based on that branch). |
|||
| msg87801 - (view) | Author: R. David Murray (r.david.murray) * ![]() |
Date: 2009-05-15 09:18 | |
> cat temp
this is$#a weird$#file$#
> ./python
Python 3.1b1+ (py3k:72632:72633M, May 15 2009, 05:11:27)
[GCC 4.3.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> f = open('temp', newline='$#')
[50354 refs]
>>> f.readlines()
['this is$#', 'a weird$#', 'file$#', '\n']
All I did was comment out the 'newline' argument validity check in textio.c.
|
|||
| msg87802 - (view) | Author: Nick Coghlan (ncoghlan) * ![]() |
Date: 2009-05-15 10:17 | |
While RDM's quick test is encouraging, I think one of the key things is going to be developing tests for the various cases: - binary mode, single byte line ending - binary mode, multi-byte line ending - text mode, single byte single char line ending* - text mode, multi-byte single char line ending - text mode, multiple char line ending The text mode tests would need to cover a variety of encodings (e.g. ASCII, latin-1, UTF-8, UTF-16, UTF-32 and maybe something like koi8-r and/or some of the CJK codecs). *if applicable to codec under test |
|||
| msg87803 - (view) | Author: Antoine Pitrou (pitrou) * ![]() |
Date: 2009-05-15 11:13 | |
-1 on this idea. readlines() exists precisely because line endings are special when it comes to text IO (because of the various platform differences). If you want to split on any character, you can just use read() followed by split(). No need to graft additional complexity on the file IO classes. |
|||
| msg87805 - (view) | Author: Antoine Pitrou (pitrou) * ![]() |
Date: 2009-05-15 11:24 | |
And it's certainly not easy to do correctly :) |
|||
| msg87806 - (view) | Author: Antoine Pitrou (pitrou) * ![]() |
Date: 2009-05-15 11:25 | |
Uh, trying again to remove the keyword :-( |
|||
| msg87807 - (view) | Author: Antoine Pitrou (pitrou) * ![]() |
Date: 2009-05-15 11:34 | |
Ok, let me qualify my position a bit:
- -1 for abusing the newline parameter
- -1 for abusing readlines()
- +0 on an additional method ("readchunks" was suggested) which does the
splitting, either on a single character or on a string
Please bear in mind the latter should involve, for each of the C and
Python implementations:
- a generic unoptimized version for BufferedIOBase
- a generic unoptimized version for TextIOBase
- an optimized version for BufferedReader/BufferedRandom
- an optimized version for TextIOWrapper
However, it is certainly an interesting task for someone wanting to play
with C code, optimizations, etc.
|
|||
| msg87808 - (view) | Author: Nick Coghlan (ncoghlan) * ![]() |
Date: 2009-05-15 11:46 | |
I agree with Antoine - given that the newlines parameter now deals with Skip's alternate line separator use case, a new method "readrecords" that takes a mandatory record separator makes more sense than using readlines to read things that are not lines. (of course, taking the alternate line ending use case away also reduces the total number of use cases for the new method). Note that the problem with the read()+split() approach is that you either have to read the whole file into memory (which this RFE is trying to avoid) or you have to do your own buffering and so forth to split records as you go. Since the latter is both difficult to get right and very similar to what the IO module already has to do for readlines(), it makes sense to include the extra complexity there. |
|||
| msg87817 - (view) | Author: Antoine Pitrou (pitrou) * ![]() |
Date: 2009-05-15 13:07 | |
> Note that the problem with the read()+split() approach is that you > either have to read the whole file into memory (which this RFE is trying > to avoid) or you have to do your own buffering and so forth to split > records as you go. Since the latter is both difficult to get right and > very similar to what the IO module already has to do for readlines(), it > makes sense to include the extra complexity there. I wonder how often this use case happens though. Usually you first split on lines, and only then you split on another character or string (think CSV files, HTTP headers, etc.). When you don't split on lines, conversely, you probably have a binary format, and binary formats have more efficient ways of chunking (for example, a couple of bytes at the beginning indicating the length of the chunk). |
|||
| msg87823 - (view) | Author: Douglas Alan (nessus42) | Date: 2009-05-15 17:46 | |
Antoine Pitrou <report@bugs.python.org> wrote: > Nick Coghlan <ncoghlan@gmail.com> added the comment: > > Note that the problem with the read()+split() approach is that you > > either have to read the whole file into memory (which this RFE is trying > > to avoid) or you have to do your own buffering and so forth to split > > records as you go. Since the latter is both difficult to get right and > > very similar to what the IO module already has to do for readlines(), it > > makes sense to include the extra complexity there. > I wonder how often this use case happens though. Every day for me. The reason that I originally brought up this request some years back on comp.lang.python was that I wanted to be able to use Python easily like I use the xargs program. E.g., find -type f -regex 'myFancyRegex' -print0 | stuff-to-do-on-each- file.py With "-print0" the line separator is chaged to null, so that you can deal with filenames that have newlines in them. ("find" and "xargs" traditionally have used newline to separate files, but that fails in the face of filenames that have newlines in them, so the -print0 argument to find and the "-0" argument to xargs were thankfully eventually added as a fix for this issue. Nulls are not allowed in filenames. At least not on Unix.) > When you don't split on lines, conversely, you probably have a binary > format, That's not true for the daily use case I just mentioned. |>ouglas P.S. I wrote my own version of readlines, of course, as the archives of comp.lang.python will show. I just don't feel that everyone should be required to do the same, when this is the sort of thing that sysadmins and other Unix-savy folks are wont to do on a daily basis. P.P.S. Another use case is that I often end up with files that have beeen transferred back and forth between Unix and Windows and god-knows-what-else, and the newlines end up being some weird mixture of carriage returns and line feeds (and sometimes some other stray characters such as "=20" or somesuch) that many programs seem to have a hard time recognizing as newlines. |
|||
| msg109038 - (view) | Author: Ralph Corderoy (ralph.corderoy) | Date: 2010-07-01 10:05 | |
Google has led me here because I'm trying to see how to process find(1)'s -print0 output with Python. Perl's -0 option and $/ variable makes this trivial.
find -name '*.orig' -print0 | perl -n0e unlink
awk(1) has its RS, record separator, variable too. There's a clear need, and it should also be possible to modify or re-open sys.stdin to change the existing separator.
|
|||
| msg109098 - (view) | Author: Éric Araujo (eric.araujo) * ![]() |
Date: 2010-07-02 10:41 | |
Ralph, core developers have not rejected this idea. It needs a patch now (even rough) to get the discussion further. |
|||
| msg109117 - (view) | Author: Douglas Alan (Douglas.Alan) | Date: 2010-07-02 17:31 | |
Until this feature gets built into Python, you can use a Python-coded generator such as this one to accomplish the same effect:
def fileLineIter(inputFile,
inputNewline="\n",
outputNewline=None,
readSize=8192):
"""Like the normal file iter but you can set what string indicates newline.
The newline string can be arbitrarily long; it need not be restricted to a
single character. You can also set the read size and control whether or not
the newline string is left on the end of the iterated lines. Setting
newline to '\0' is particularly good for use with an input file created with
something like "os.popen('find -print0')".
"""
if outputNewline is None: outputNewline = inputNewline
partialLine = ''
while True:
charsJustRead = inputFile.read(readSize)
if not charsJustRead: break
partialLine += charsJustRead
lines = partialLine.split(inputNewline)
partialLine = lines.pop()
for line in lines: yield line + outputNewline
if partialLine: yield partialLine
|
|||
| msg111152 - (view) | Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * ![]() |
Date: 2010-07-22 06:44 | |
This fileLineIter function looks like a good recipe to me. Can we close the issue then? |
|||
| msg111168 - (view) | Author: Nick Coghlan (ncoghlan) * ![]() |
Date: 2010-07-22 11:42 | |
A recipe in the comments on a tracker item isn't enough reason to close the RFE, no. An entry on the cookbook with a pointer from the docs might be sufficient, although I'm still not averse to the idea of an actual readrecords method (with appropriate tests). |
|||
| msg111177 - (view) | Author: ysj.ray (ysj.ray) | Date: 2010-07-22 14:32 | |
I think it's a good idea adding a keyword argument to specify the separator of readlines(). I believe most people can accept the universal meaning of "line", which has similar meaning of "record", that is a chunk data, maybe from using line separators other than '\n' in perl, or akw, or the find command. Maybe doing this doesn't pollute the meaning of "readlines". Splitting the file contents with s special character is really a common usage. Besides, I feel using a line separator other than '\n' doesn't mean we're dealing with binary format, in fact, I often deal with text format with the record separator '\t'. |
|||
| msg111189 - (view) | Author: Douglas Alan (Douglas.Alan) | Date: 2010-07-22 16:33 | |
Personally, I think that this functionality should be built into Python's readlines. That's where a typical person would expect it to be, and this is something that is supported by most other scripting language I've used. E.g., awk has the RS variable which lets you set the "input record separator", which defaults to newline. And as I previously pointed out, xargs and find provide the option to use null as their line separator. |
|||
| msg111202 - (view) | Author: Antoine Pitrou (pitrou) * ![]() |
Date: 2010-07-22 17:54 | |
> Personally, I think that this functionality should be built into > Python's readlines. That's where a typical person would expect it to > be, and this is something that is supported by most other scripting > language I've used. Adding it to readline() and/or readlines() would modify the standard IO Abstract Base Classes, and would therefore probably need discussion on python-dev. |
|||
| msg111220 - (view) | Author: Nick Coghlan (ncoghlan) * ![]() |
Date: 2010-07-22 22:15 | |
On Fri, Jul 23, 2010 at 3:54 AM, Antoine Pitrou <report@bugs.python.org> wrote: > > Antoine Pitrou <pitrou@free.fr> added the comment: > >> Personally, I think that this functionality should be built into >> Python's readlines. That's where a typical person would expect it to >> be, and this is something that is supported by most other scripting >> language I've used. > > Adding it to readline() and/or readlines() would modify the standard IO > Abstract Base Classes, and would therefore probably need discussion on > python-dev. That's also the reason why I'm suggesting a separate readrecords() method - the appropriate ABC should be able to implement it as a concrete method based on something like the recipe above. |
|||
| msg111453 - (view) | Author: Ralph Corderoy (ralph.corderoy) | Date: 2010-07-24 11:13 | |
fileLineIter() is not a solution that allows this bug to be closed, no.
readline() needs modifying and if that means python-dev discussion then
that's what it needs. Things to consider include changing the record
separator as the file is read.
$ printf 'a b c\nd e f ' |
> awk '{print "<" $0 ">"} NR == 1 {RS = " "}'
<a b c>
<d>
<e>
<f>
$
|
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2012-08-20 05:46:03 | ncoghlan | set | title: Enhance file.readlines by making line separator selectable -> Add support for reading records with arbitrary separators to the standard IO stack versions: + Python 3.4, - Python 3.2 |
| 2011-06-01 01:20:37 | jcon | set | nosy:
+ jcon |
| 2010-07-24 11:13:27 | ralph.corderoy | set | messages: + msg111453 |
| 2010-07-22 22:15:10 | ncoghlan | set | messages: + msg111220 |
| 2010-07-22 17:54:13 | pitrou | set | messages: + msg111202 |
| 2010-07-22 16:33:43 | Douglas.Alan | set | messages: + msg111189 |
| 2010-07-22 14:32:43 | ysj.ray | set | nosy:
+ ysj.ray messages: + msg111177 |
| 2010-07-22 11:42:53 | ncoghlan | set | status: pending -> open resolution: works for me -> messages: + msg111168 |
| 2010-07-22 06:44:24 | amaury.forgeotdarc | set | status: open -> pending nosy: + amaury.forgeotdarc messages: + msg111152 resolution: works for me |
| 2010-07-02 17:31:17 | Douglas.Alan | set | nosy:
+ Douglas.Alan messages: + msg109117 |
| 2010-07-02 10:41:06 | eric.araujo | set | nosy:
georg.brandl, rhettinger, facundobatista, ncoghlan, pitrou, benjamin.peterson, nessus42, eric.araujo, ralph.corderoy, r.david.murray messages: + msg109098 components: + Library (Lib), - Interpreter Core |
| 2010-07-01 10:05:04 | ralph.corderoy | set | nosy:
+ ralph.corderoy messages: + msg109038 |
| 2010-04-13 19:59:57 | eric.araujo | set | nosy:
+ eric.araujo |
| 2009-05-15 17:46:23 | nessus42 | set | messages: + msg87823 |
| 2009-05-15 13:07:55 | pitrou | set | messages: + msg87817 |
| 2009-05-15 11:47:10 | ncoghlan | set | messages: - msg87809 |
| 2009-05-15 11:46:53 | ncoghlan | set | messages: + msg87809 |
| 2009-05-15 11:46:28 | ncoghlan | set | messages: + msg87808 |
| 2009-05-15 11:34:04 | pitrou | set | messages: + msg87807 |
| 2009-05-15 11:25:13 | pitrou | set | keywords:
- easy messages: + msg87806 |
| 2009-05-15 11:24:26 | pitrou | set | messages: + msg87805 |
| 2009-05-15 11:13:00 | pitrou | set | messages: + msg87803 |
| 2009-05-15 10:17:51 | ncoghlan | set | messages: + msg87802 |
| 2009-05-15 09:18:37 | r.david.murray | set | keywords:
+ easy nosy: + r.david.murray messages: + msg87801 |
| 2009-05-15 02:53:53 | ajaksu2 | set | nosy:
+ benjamin.peterson, pitrou components: + IO versions: + Python 3.2, - Python 3.1 |
| 2009-02-16 06:15:00 | skip.montanaro | set | nosy: - montanaro.historic |
| 2009-02-15 23:58:45 | ncoghlan | set | messages:
+ msg82188 stage: test needed -> needs patch |
| 2009-02-15 23:49:49 | ajaksu2 | set | stage: test needed versions: + Python 3.1, - Python 3.0 |
| 2008-03-19 18:52:17 | facundobatista | set | assignee: facundobatista -> messages: + msg64084 versions: + Python 3.0 |
| 2008-02-29 11:58:52 | ncoghlan | set | messages: + msg63134 |
| 2008-02-27 11:08:10 | facundobatista | set | assignee: facundobatista messages: + msg63068 |
| 2008-02-27 08:48:48 | rhettinger | set | status: closed -> open resolution: rejected -> (no value) messages: + msg63067 |
| 2008-02-27 02:41:20 | facundobatista | set | status: open -> closed nosy: + facundobatista resolution: rejected messages: + msg63060 |
| 2005-02-26 07:24:20 | ncoghlan | create | |
