classification
Title: Add named tuple reader to CSV module
Type: feature request Stage: patch review
Components: Library (Lib) Versions: Python 3.1, Python 2.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: barry Nosy List: barry, jdwhitley, pitrou, rhettinger, rrenaud, skip.montanaro (6)
Priority: Keywords patch

Created on 2008-01-13 22:27 by rhettinger, last changed 2009-03-09 00:08 by jdwhitley.

Files
File name Uploaded Description Edit Remove
ntreader.diff rhettinger, 2008-01-13 22:27 Proof-of-concept patch
ntreader3.diff jdwhitley, 2009-02-09 09:24 namedtuple reader and writer.
ntreader4.diff jdwhitley, 2009-02-10 11:08 Includes revision for rename keyword argument
named_tuple_write_header2.patch rrenaud, 2009-02-26 07:59
ntreader4_py3_1.diff jdwhitley, 2009-03-08 04:34 Patch against python 3.1a1
ntreader6_py3.diff jdwhitley, 2009-03-09 00:06 updated documentation
ntreader6_py27.diff jdwhitley, 2009-03-09 00:07 updated documentation
Messages (30)
msg59866 - (view) Author: Raymond Hettinger (rhettinger) * Date: 2008-01-13 22:27
Here's a proof-of-concept patch.  If approved, will change from
generator form to match the other readers and will add a test suite.

The idea corresponds to what is currently done by the dict reader but
returns a space and time efficient named tuple instead of a dict.  Field
order is preserved and named attribute access is supported.

A writer is not needed because named tuples can be feed into the
existing writer just like regular tuples.
msg61523 - (view) Author: Raymond Hettinger (rhettinger) * Date: 2008-01-22 19:25
Barry, any thoughts on this?
msg61532 - (view) Author: Skip Montanaro (skip.montanaro) Date: 2008-01-22 20:12
I'd personally be kind of surprised if Barry had any thoughts on this.
Is there any reason this couldn't be pushed down into the C code and
replace the normal tuple output completely?  In the absence of any
fieldnames you could just dream some up, like "field001", "field002",
etc.

Skip
msg81453 - (view) Author: Jervis Whitley (jdwhitley) Date: 2009-02-09 09:24
An implementation of a namedtuple reader and writer.

Created a writer for the case where user would like to specify
desired field names and default values on missing field names.

e.g.
mywriter = NamedTupleWriter(f, fieldnames=['f1', 'f2', 'f3'], 
                            restval='missing')

Nt = namedtuple('LessFields', 'f1 f3')
nt = Nt(f1='one', f2=2)

mywriter.writerow(nt) # writes one,missing,2

any thoughts on case where defined fieldname has a leading 
underscore? Should there be a flag to silently ignore? 

e.g. 
if self._ignore_underscores:
   fieldname = fieldname.lstrip('_')

Leading underscores may be present in an unsighted csv file,
additionally, spaces and other non alpha numeric characters pose 
a problem that does not affect the DictReader class. 

Cheers,
msg81464 - (view) Author: Raymond Hettinger (rhettinger) * Date: 2009-02-09 16:53
Consider providing a hook to a function that converts non-conforming
field names (ones with a leading underscore, leading digit, non-letter,
keyword, or duplicate name).

class NamedTupleReader:
    def __init__(self, f, fieldnames=None, restkey=None, restval=None,
                 dialect="excel", fieldnamer=None, *args, **kwds):
                 . . .

I'm going to either post a recipe to do the renaming or provide a static
method for the same purpose.   It might work like this:

  >>> renamer(['abc', 'def', '1', '_hidden', 'abc', 'p', 'abc'])
  ['abc', 'x_def', 'x_1', 'x_hidden', 'x_abc', 'p', 'x1_abc']
msg81518 - (view) Author: Raymond Hettinger (rhettinger) * Date: 2009-02-10 01:25
In r69480, named tuples gained the ability to automatically rename
invalid fieldnames.
msg81537 - (view) Author: Jervis Whitley (jdwhitley) Date: 2009-02-10 11:08
Updated NamedTupleReader to give a rename=False keyword argument.
rename is passed directly to the namedtuple factory function to enable
automatic handling of invalid fieldnames.

Two new tests for the rename keyword.

Cheers,
msg82744 - (view) Author: Rob Renaud (rrenaud) Date: 2009-02-26 07:38
I am totally new to Python dev.  I reinvented a NamedTupleReader
tonight, only to find out that it was created a year ago.  My primary
motivation is that DictReader reads headers nicely, but DictWriter
totally sucks at handling them.

Consider doing some filtering on a csv file, like so.

sample_data = [
    'title,latitude,longitude',
    'OHO Ofner & Hammecke Reinigungsgesellschaft mbH,48.128265,11.610848',
    'Kitchen Kaboodle,45.544241,-122.715728',
    'Walgreens,28.339727,-81.596367',
    'Gurnigel Pass,46.731944,7.447778'
    ]

def filter_with_dict_reader_writer():
  accepted_rows = []
  for row in csv.DictReader(sample_data):
    if float(row['latitude']) > 0.0 and float(row['longitude']) > 0.0:
      accepted_rows.append(row)

  field_names = csv.reader(sample_data).next()
  output_writer = csv.DictWriter(open('accepted_by_dict.csv', 'w'),
                                 field_names)
  output_writer.writerow(dict(zip(field_names, field_names)))
  output_writer.writerows(accepted_rows)

You have to work so hard to maintain the headers when you write the file
with DictWriter.  I understand this is a limitation of dicts throwing
away the order information.  But namedtuples don't have that problem.

NamedTupleReader and NamedTupleWriter should be inverses.  This means
that NamedTupleWriter needs to write headers.  This should produce
identical output as the dict writer example, but it's much cleaner.

def filter_with_named_tuple_reader_writer():
   accepted_rows = []
   for row in csv.NamedTupleReader(sample_data):
     if float(row.latitude) > 0.0 and float(row.longitude) > 0.0:
       accepted_rows.append(row)

   output_writer = csv.NamedTupleWriter(
       open('accepted_by_named_tuple.csv', 'w'))
   output_writer.writerows(accepted_rows)

I patched on top of the existing NamedTupleWriter patch adding support
for writing headers.  I don't know if that's bad style/etiquette, etc.
msg82745 - (view) Author: Rob Renaud (rrenaud) Date: 2009-02-26 07:59
My previous patch could write the header twice.  But I am not sure about
about how the writer should handle the fieldnames parameter on one hand,
and the namedtuple._fields on the other.
msg82746 - (view) Author: Raymond Hettinger (rhettinger) * Date: 2009-02-26 08:01
The two latest patches (ntreader4.diff and
named_tuple_write_header.patch) seem like they are going in the right
direction and are getting close.

Barry or Skip, is this something you want in your module?
msg82764 - (view) Author: Skip Montanaro (skip.montanaro) Date: 2009-02-26 15:44
Raymond> Barry or Skip, is this something you want in your module?

Sorry, I haven't really looked at this ticket other than to notice its
presence.  I wrote the DictReader/DictWriter functions way back when, so I'm
pretty comfortable using them.  I haven't felt the need for any other reader
or writer which manipulates file headers.

Skip
msg82765 - (view) Author: Barry A. Warsaw (barry) * Date: 2009-02-26 15:47
I think it would be useful to have.
msg82770 - (view) Author: Skip Montanaro (skip.montanaro) Date: 2009-02-26 19:02
Hrm... I replied twice by email.  Only one comment appears to have
survived the long trip.  Here's my second reply:


    Rob> NamedTupleReader and NamedTupleWriter should be inverses.  This
    Rob> means that NamedTupleWriter needs to write headers.  This should
    Rob> produce identical output as the dict writer example, but it's much
    Rob> cleaner.

You're assuming that one instance of these classes will read or write an
entire file.  What if you want to append lines to an existing CSV file or
pick up reading a file with a new reader which has already be partially
processed?
msg82771 - (view) Author: Skip Montanaro (skip.montanaro) Date: 2009-02-26 19:04
Let me be more explicit.  I don't know how it implements it, but I think
you really need to give the user the option of specifying the field
names and not reading/writing headers.  It can't be implicit as I
interpreted Rob's earlier comment:

    > NamedTupleReader and NamedTupleWriter should be inverses.
    > This means that NamedTupleWriter needs to write headers.

Skip
msg82778 - (view) Author: Jervis Whitley (jdwhitley) Date: 2009-02-26 21:02
Skip> Let me be more explicit.  I don't know how it implements it, but I 
think
Skip> you really need to give the user the option of specifying the 
field
Skip> names and not reading/writing headers.  It can't be implicit as I
Skip> interpreted Rob's earlier comment:

    rrenaud> NamedTupleReader and NamedTupleWriter should be inverses.
    rrenaud> This means that NamedTupleWriter needs to write headers.

I agree with Skip, we mustn't have a 'wroteheader' flag internal to the 
NamedTupleWriter.

Currently to write a 'header' row with a csv.writer you could (for 
example) pass a tuple of header names to writerow. NamedTupleWriter
is no different, you would have a namedtuple of header names instead of
a tuple of header names.

I would not like to see another flag added to the initialisation process
to enable the writing of a header row as the 'first' (or any) row 
written to a file.  We could add a function 'writeheader' that would
write the contents of 'fieldnames' as a row, but I don't like the idea.

Cheers,
msg82780 - (view) Author: Rob Renaud (rrenaud) Date: 2009-02-26 22:18
I want to make sure I understand.  Am I correct in believing that Skip
thinks writing headers should be optional, while Jervis believes we
should leave the burden to the NamedTupleWriter client?  

I agree that we should not unconditionally write headers, but I think
that we should write headers by default, much like we read them by default.

I believe the implicit header writing is very elegant, and the only
reason that the DictWriter object doesn't write headers is the impedance
mismatch between dicts and CSV.  namedtuples has the field order
information, the impedance mismatch is gone, we should no longer be
hindered.  Implicitly reading but not explicitly writing headers just
seems wrong.

It also seems wrong to require the construction of "header" namedtuple
objects.  It's much less natural than dicts holding identity mappings.

>>> Point._make(Point._fields)
Point(x='x', y='y')

To me, that just looks weird and non-obvious to me.  That Point instance
doesn't really fit in my mind as something that should be a Point.
msg82798 - (view) Author: Skip Montanaro (skip.montanaro) Date: 2009-02-27 00:55
Rob> I agree that we should not unconditionally write headers, but I
    Rob> think that we should write headers by default, much like we read
    Rob> them by default.

I don't think you should write them by default.  I've worked with lots of
CSV files which have no headers.  I can imagine people wanting to write CSV
files with multiple headers.  It should be optional and explicit.

Skip
msg82799 - (view) Author: Skip Montanaro (skip.montanaro) Date: 2009-02-27 01:00
More concretely, I don't think this is so onerous:

    names = ["col1", "col2", "color"]
    writer = csv.DictWriter(open("f.csv", "wb"), fieldnames=names, ...)
    writer.writerow(dict(zip(names, names)))
    ...

or

    f = open("f.csv", "rb")
    names = csv.reader(f).next()
    reader = csv.DictReader(f, fieldnames=names, ...)
    ...

Skip
msg82812 - (view) Author: Rob Renaud (rrenaud) Date: 2009-02-27 02:16
I did a search on Google code for the DictReader constructor.  I
analyzed the first 3 pages, the fieldnames parameter was used in 14 of
27 cases (discounting unittest code built into Python) and was not
used in 13 of 27 cases.  I suppose that means headered csv files are
sufficiently rare that they shouldn't be created implicitly by
default.  I still don't like the lack of symmetry of supporting
implicit header reads, but not implicit header writes.

On Thu, Feb 26, 2009 at 8:00 PM, Skip Montanaro <report@bugs.python.org> wrote:
>
> Skip Montanaro <skip@pobox.com> added the comment:
>
> More concretely, I don't think this is so onerous:
>
>    names = ["col1", "col2", "color"]
>    writer = csv.DictWriter(open("f.csv", "wb"), fieldnames=names, ...)
>    writer.writerow(dict(zip(names, names)))
>    ...
>
> or
>
>    f = open("f.csv", "rb")
>    names = csv.reader(f).next()
>    reader = csv.DictReader(f, fieldnames=names, ...)
>    ...
>
> Skip
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue1818>
> _______________________________________
>
msg82814 - (view) Author: Raymond Hettinger (rhettinger) * Date: 2009-02-27 02:43
> I don't think you should write them by default.  
> I've worked with lots of CSV files which have no headers. 

My experience has been the same as Skips.
msg82819 - (view) Author: Skip Montanaro (skip.montanaro) Date: 2009-02-27 04:45
Rob> I still don't like the lack of symmetry of supporting implicit
    Rob> header reads, but not implicit header writes.

A header is nothing more than a row in the CSV file with special
interpretation applied by the user.  There is nothing implicit about it.
If you know the first line is a header, use the recipe I posted.  If not,
supply your own fieldnames and treat the first row as data.

Skip
msg83298 - (view) Author: Jervis Whitley (jdwhitley) Date: 2009-03-08 04:34
Added a patch against py3k branch.

in csv.rst removed reference to reader.next() as a public method.
msg83299 - (view) Author: Skip Montanaro (skip.montanaro) Date: 2009-03-08 04:40
Jervis> in csv.rst removed reference to reader.next() as a public method.

Because?  I've not seen any discussion in this issue or in any other forums
(most certainly not on the csv@python.org mailing list) which would suggest
that csv.reader's next() method should no longer be a public method.

Skip
msg83310 - (view) Author: Antoine Pitrou (pitrou) Date: 2009-03-08 13:33
I don't understand why NamedTupleReader requires the fieldnames array
rather than the namedtuple class itself. If you could pass it the
namedtuple class, users could choose whatever namedtuple subclass with
whatever additional methods or behaviour suits them. It would make
NamedTupleReader more flexible and more useful.
msg83318 - (view) Author: Skip Montanaro (skip.montanaro) Date: 2009-03-08 19:13
I don't know how NamedTuple objects work, but in many situations you
want the content of the CSV file to drive the output.  I would think
you would use a technique similar to my DictReader example to tell
the NamedTupleReader the fieldnames.  For that you need a fieldnames
argument.
msg83321 - (view) Author: Skip Montanaro (skip.montanaro) Date: 2009-03-08 19:21
I retract my previous comment.  I don't use the DictReader the way it
operates (fieldnames==None => first row is a header) and forgot about
that behavior.
msg83332 - (view) Author: Jervis Whitley (jdwhitley) Date: 2009-03-08 22:53
Jervis> in csv.rst removed reference to reader.next() as a public method.

Skip> Because?  I've not seen any discussion in this issue or in any
Skip> other forums
Skip> (most certainly not on the csv@python.org mailing list) which
would Skip> suggest
Skip> that csv.reader's next() method should no longer be a public method.

I agree, this should be applied separately.
msg83333 - (view) Author: Jervis Whitley (jdwhitley) Date: 2009-03-08 23:13
Antoine> I don't understand why NamedTupleReader requires the 
Antoine> fieldnames array
Antoine> rather than the namedtuple class itself. If you could pass it
Antoine> the namedtuple class, users could choose whatever namedtuple 
Antoine> subclass with whatever additional methods or behaviour suits
Antoine> them. It would make NamedTupleReader more flexible and more 
Antoine> useful.

The NamedTupleReader does take the namedtuple class as the fieldnames
argument. It can be a namedtuple, a 'fieldnames' array or None. 
If a namedtuple is used as the fieldnames argument, returned rows are
created using ._make from the this namedtuple. Unless I have read your
requirements incorrectly, this is the behaviour you describe.

Given the confusion, I accept that the documentation needs to be improved. 

The NamedTupleReader and Writer were created to follow as closely as
possible the behaviour (and signature) of the DictReader and DictWriter,
with the exception of using namedtuples instead of dicts.
msg83334 - (view) Author: Antoine Pitrou (pitrou) Date: 2009-03-08 23:21
Ok, I got misled by the documentation ("The contents of *fieldnames* are
passed directly to be used as the namedtuple fieldnames"), and your
implementation is a bit difficult to follow.
msg83340 - (view) Author: Jervis Whitley (jdwhitley) Date: 2009-03-09 00:06
Updated version of docs for 2.7 and 3k.
History
Date User Action Args
2009-03-09 00:08:01jdwhitleysetfiles: - ntreader5_py3_1.diff
2009-03-09 00:07:46jdwhitleysetfiles: + ntreader6_py27.diff
2009-03-09 00:07:04jdwhitleysetfiles: + ntreader6_py3.diff

messages: + msg83340
2009-03-08 23:21:31pitrousetmessages: + msg83334
2009-03-08 23:13:33jdwhitleysetmessages: + msg83333
2009-03-08 22:53:06jdwhitleysetfiles: + ntreader5_py3_1.diff

messages: + msg83332
2009-03-08 19:21:00skip.montanarosetmessages: + msg83321
2009-03-08 19:19:42skip.montanarosetmessages: - msg83319
2009-03-08 19:18:15skip.montanarosetmessages: + msg83319
2009-03-08 19:13:24skip.montanarosetmessages: + msg83318
2009-03-08 13:33:50pitrousetnosy: + pitrou
messages: + msg83310
2009-03-08 04:40:59skip.montanarosetmessages: + msg83299
2009-03-08 04:34:07jdwhitleysetfiles: + ntreader4_py3_1.diff
messages: + msg83298
2009-02-27 04:45:49skip.montanarosetmessages: + msg82819
2009-02-27 02:43:22rhettingersetmessages: + msg82814
2009-02-27 02:16:55rrenaudsetmessages: + msg82812
2009-02-27 01:00:23skip.montanarosetmessages: + msg82799
2009-02-27 00:55:40skip.montanarosetmessages: + msg82798
2009-02-26 22:18:40rrenaudsetmessages: + msg82780
2009-02-26 21:02:58jdwhitleysetmessages: + msg82778
2009-02-26 19:04:27skip.montanarosetmessages: + msg82771
2009-02-26 19:02:03skip.montanarosetmessages: + msg82770
2009-02-26 15:47:25barrysetmessages: + msg82765
2009-02-26 15:44:54skip.montanarosetmessages: + msg82764
2009-02-26 08:01:10rhettingersettype: feature request
stage: patch review
messages: + msg82746
versions: + Python 3.1, Python 2.7, - Python 2.6
2009-02-26 07:59:24rrenaudsetfiles: - named_tuple_write_header.patch
2009-02-26 07:59:15rrenaudsetfiles: + named_tuple_write_header2.patch
messages: + msg82745
2009-02-26 07:38:36rrenaudsetfiles: + named_tuple_write_header.patch
nosy: + rrenaud
messages: + msg82744
2009-02-10 11:08:09jdwhitleysetfiles: + ntreader4.diff
messages: + msg81537
2009-02-10 01:25:16rhettingersetmessages: + msg81518
2009-02-09 16:54:00rhettingersetmessages: + msg81464
2009-02-09 09:25:00jdwhitleysetfiles: + ntreader3.diff
nosy: + jdwhitley
messages: + msg81453
keywords: + patch
2008-01-22 20:12:41skip.montanarosetnosy: + skip.montanaro
messages: + msg61532
2008-01-22 19:25:56rhettingersetmessages: + msg61523
2008-01-13 22:27:14rhettingercreate