Message 83441 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	dlesco
Recipients	dlesco, facundobatista, lemburg
Date	2009-03-10.17:35:59
SpamBayes Score	1.7015962e-09
Marked as misclassified	No
Message-id	<1236706563.47.0.656290158457.issue5445@psf.upfronthosting.co.za>
In-reply-to

Content
OK, I think I see where I went wrong in my perceptions of the file protocol. I thought that readlines() returned an iterator, not a list, but I see in the library reference manual on File Objects that it returns a list. I think I got confused because there is no equivalent of __iter__ for writing to streams. For input, I'm always using 'for line in file_object' (in other words, file_object.__iter__), so I had assumed that writelines was the mirror image of that, because I never use the readlines method. Then, in my mind, readlines became the mirror image of writelines, which I had assumed took an iterator, so I assumed that readlines returned an iterator. I wonder if this perception problem is common or not. So, the StreamWriter interface matches the file protocol; readlines() and writelines() deal with lists. There shouldn't be any change to it, because it follows the protocol. Then, the example I wrote would be instead: rows = (line[:-1].split('\t') for line in in_file) projected = (keep_fields(row, 0, 3, 7) for row in rows) filtered = (row for row in projected if row[2]=='1') formatted = (u'\t'.join(row)+'\n' for row in filtered) write = out_file.write for line in formatted: write(line) I think it's correct that the file object write C code only does 1000-line chunks for sequences that have a defined length: if it has a defined length, then that implies that the data exists now, and can be concatenated and written now. Something without a defined length may be a generator with items arriving later.

OK, I think I see where I went wrong in my perceptions of the file 
protocol.  I thought that readlines() returned an iterator, not a 
list, but I see in the library reference manual on File Objects that 
it returns a list.  I think I got confused because there is no 
equivalent of __iter__ for writing to streams.  For input, I'm always 
using 'for line in file_object' (in other words, 
file_object.__iter__), so I had assumed that writelines was the mirror 
image of that, because I never use the readlines method.  Then, in my 
mind, readlines became the mirror image of writelines, which I had 
assumed took an iterator, so I assumed that readlines returned an 
iterator.  I wonder if this perception problem is common or not.

So, the StreamWriter interface matches the file protocol; readlines() 
and writelines() deal with lists.  There shouldn't be any change to 
it, because it follows the protocol.

Then, the example I wrote would be instead:

rows = (line[:-1].split('\t') for line in in_file)
projected = (keep_fields(row, 0, 3, 7) for row in rows)
filtered = (row for row in projected if row[2]=='1')
formatted = (u'\t'.join(row)+'\n' for row in filtered)
write = out_file.write
for line in formatted:
    write(line)

I think it's correct that the file object write C code only does 
1000-line chunks for sequences that have a defined length: if it has a 
defined length, then that implies that the data exists now, and can be 
concatenated and written now.  Something without a defined length may 
be a generator with items arriving later.

History
Date	User	Action	Args
2009-03-10 17:36:04	dlesco	set	recipients: + dlesco, lemburg, facundobatista
2009-03-10 17:36:03	dlesco	set	messageid: <1236706563.47.0.656290158457.issue5445@psf.upfronthosting.co.za>
2009-03-10 17:36:01	dlesco	link	issue5445 messages
2009-03-10 17:36:00	dlesco	create