classification
Title: Add named tuple reader to CSV module
Type: enhancement Stage: needs patch
Components: Library (Lib) Versions: Python 3.3
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: asvetlov, barry, eric.araujo, jdwhitley, pitrou, rhettinger, rrenaud
Priority: low Keywords: patch

Created on 2008-01-13 22:27 by rhettinger, last changed 2014-02-03 19:05 by BreamoreBoy.

Files
File name Uploaded Description Edit
ntreader.diff rhettinger, 2008-01-13 22:27 Proof-of-concept patch
ntreader3.diff jdwhitley, 2009-02-09 09:24 namedtuple reader and writer.
ntreader4.diff jdwhitley, 2009-02-10 11:08 Includes revision for rename keyword argument
named_tuple_write_header2.patch rrenaud, 2009-02-26 07:59
ntreader4_py3_1.diff jdwhitley, 2009-03-08 04:34 Patch against python 3.1a1
ntreader6_py3.diff jdwhitley, 2009-03-09 00:06 updated documentation
ntreader6_py27.diff jdwhitley, 2009-03-09 00:07 updated documentation
Messages (36)
msg59866 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2008-01-13 22:27
Here's a proof-of-concept patch.  If approved, will change from
generator form to match the other readers and will add a test suite.

The idea corresponds to what is currently done by the dict reader but
returns a space and time efficient named tuple instead of a dict.  Field
order is preserved and named attribute access is supported.

A writer is not needed because named tuples can be feed into the
existing writer just like regular tuples.
msg61523 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2008-01-22 19:25
Barry, any thoughts on this?
msg61532 - (view) Author: Skip Montanaro (skip.montanaro) * (Python committer) Date: 2008-01-22 20:12
I'd personally be kind of surprised if Barry had any thoughts on this.
Is there any reason this couldn't be pushed down into the C code and
replace the normal tuple output completely?  In the absence of any
fieldnames you could just dream some up, like "field001", "field002",
etc.

Skip
msg81453 - (view) Author: Jervis Whitley (jdwhitley) Date: 2009-02-09 09:24
An implementation of a namedtuple reader and writer.

Created a writer for the case where user would like to specify
desired field names and default values on missing field names.

e.g.
mywriter = NamedTupleWriter(f, fieldnames=['f1', 'f2', 'f3'], 
                            restval='missing')

Nt = namedtuple('LessFields', 'f1 f3')
nt = Nt(f1='one', f2=2)

mywriter.writerow(nt) # writes one,missing,2

any thoughts on case where defined fieldname has a leading 
underscore? Should there be a flag to silently ignore? 

e.g. 
if self._ignore_underscores:
   fieldname = fieldname.lstrip('_')

Leading underscores may be present in an unsighted csv file,
additionally, spaces and other non alpha numeric characters pose 
a problem that does not affect the DictReader class. 

Cheers,
msg81464 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2009-02-09 16:53
Consider providing a hook to a function that converts non-conforming
field names (ones with a leading underscore, leading digit, non-letter,
keyword, or duplicate name).

class NamedTupleReader:
    def __init__(self, f, fieldnames=None, restkey=None, restval=None,
                 dialect="excel", fieldnamer=None, *args, **kwds):
                 . . .

I'm going to either post a recipe to do the renaming or provide a static
method for the same purpose.   It might work like this:

  >>> renamer(['abc', 'def', '1', '_hidden', 'abc', 'p', 'abc'])
  ['abc', 'x_def', 'x_1', 'x_hidden', 'x_abc', 'p', 'x1_abc']
msg81518 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2009-02-10 01:25
In r69480, named tuples gained the ability to automatically rename
invalid fieldnames.
msg81537 - (view) Author: Jervis Whitley (jdwhitley) Date: 2009-02-10 11:08
Updated NamedTupleReader to give a rename=False keyword argument.
rename is passed directly to the namedtuple factory function to enable
automatic handling of invalid fieldnames.

Two new tests for the rename keyword.

Cheers,
msg82744 - (view) Author: Rob Renaud (rrenaud) Date: 2009-02-26 07:38
I am totally new to Python dev.  I reinvented a NamedTupleReader
tonight, only to find out that it was created a year ago.  My primary
motivation is that DictReader reads headers nicely, but DictWriter
totally sucks at handling them.

Consider doing some filtering on a csv file, like so.

sample_data = [
    'title,latitude,longitude',
    'OHO Ofner & Hammecke Reinigungsgesellschaft mbH,48.128265,11.610848',
    'Kitchen Kaboodle,45.544241,-122.715728',
    'Walgreens,28.339727,-81.596367',
    'Gurnigel Pass,46.731944,7.447778'
    ]

def filter_with_dict_reader_writer():
  accepted_rows = []
  for row in csv.DictReader(sample_data):
    if float(row['latitude']) > 0.0 and float(row['longitude']) > 0.0:
      accepted_rows.append(row)

  field_names = csv.reader(sample_data).next()
  output_writer = csv.DictWriter(open('accepted_by_dict.csv', 'w'),
                                 field_names)
  output_writer.writerow(dict(zip(field_names, field_names)))
  output_writer.writerows(accepted_rows)

You have to work so hard to maintain the headers when you write the file
with DictWriter.  I understand this is a limitation of dicts throwing
away the order information.  But namedtuples don't have that problem.

NamedTupleReader and NamedTupleWriter should be inverses.  This means
that NamedTupleWriter needs to write headers.  This should produce
identical output as the dict writer example, but it's much cleaner.

def filter_with_named_tuple_reader_writer():
   accepted_rows = []
   for row in csv.NamedTupleReader(sample_data):
     if float(row.latitude) > 0.0 and float(row.longitude) > 0.0:
       accepted_rows.append(row)

   output_writer = csv.NamedTupleWriter(
       open('accepted_by_named_tuple.csv', 'w'))
   output_writer.writerows(accepted_rows)

I patched on top of the existing NamedTupleWriter patch adding support
for writing headers.  I don't know if that's bad style/etiquette, etc.
msg82745 - (view) Author: Rob Renaud (rrenaud) Date: 2009-02-26 07:59
My previous patch could write the header twice.  But I am not sure about
about how the writer should handle the fieldnames parameter on one hand,
and the namedtuple._fields on the other.
msg82746 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2009-02-26 08:01
The two latest patches (ntreader4.diff and
named_tuple_write_header.patch) seem like they are going in the right
direction and are getting close.

Barry or Skip, is this something you want in your module?
msg82764 - (view) Author: Skip Montanaro (skip.montanaro) * (Python committer) Date: 2009-02-26 15:44
Raymond> Barry or Skip, is this something you want in your module?

Sorry, I haven't really looked at this ticket other than to notice its
presence.  I wrote the DictReader/DictWriter functions way back when, so I'm
pretty comfortable using them.  I haven't felt the need for any other reader
or writer which manipulates file headers.

Skip
msg82765 - (view) Author: Barry A. Warsaw (barry) * (Python committer) Date: 2009-02-26 15:47
I think it would be useful to have.
msg82770 - (view) Author: Skip Montanaro (skip.montanaro) * (Python committer) Date: 2009-02-26 19:02
Hrm... I replied twice by email.  Only one comment appears to have
survived the long trip.  Here's my second reply:


    Rob> NamedTupleReader and NamedTupleWriter should be inverses.  This
    Rob> means that NamedTupleWriter needs to write headers.  This should
    Rob> produce identical output as the dict writer example, but it's much
    Rob> cleaner.

You're assuming that one instance of these classes will read or write an
entire file.  What if you want to append lines to an existing CSV file or
pick up reading a file with a new reader which has already be partially
processed?
msg82771 - (view) Author: Skip Montanaro (skip.montanaro) * (Python committer) Date: 2009-02-26 19:04
Let me be more explicit.  I don't know how it implements it, but I think
you really need to give the user the option of specifying the field
names and not reading/writing headers.  It can't be implicit as I
interpreted Rob's earlier comment:

    > NamedTupleReader and NamedTupleWriter should be inverses.
    > This means that NamedTupleWriter needs to write headers.

Skip
msg82778 - (view) Author: Jervis Whitley (jdwhitley) Date: 2009-02-26 21:02
Skip> Let me be more explicit.  I don't know how it implements it, but I 
think
Skip> you really need to give the user the option of specifying the 
field
Skip> names and not reading/writing headers.  It can't be implicit as I
Skip> interpreted Rob's earlier comment:

    rrenaud> NamedTupleReader and NamedTupleWriter should be inverses.
    rrenaud> This means that NamedTupleWriter needs to write headers.

I agree with Skip, we mustn't have a 'wroteheader' flag internal to the 
NamedTupleWriter.

Currently to write a 'header' row with a csv.writer you could (for 
example) pass a tuple of header names to writerow. NamedTupleWriter
is no different, you would have a namedtuple of header names instead of
a tuple of header names.

I would not like to see another flag added to the initialisation process
to enable the writing of a header row as the 'first' (or any) row 
written to a file.  We could add a function 'writeheader' that would
write the contents of 'fieldnames' as a row, but I don't like the idea.

Cheers,
msg82780 - (view) Author: Rob Renaud (rrenaud) Date: 2009-02-26 22:18
I want to make sure I understand.  Am I correct in believing that Skip
thinks writing headers should be optional, while Jervis believes we
should leave the burden to the NamedTupleWriter client?  

I agree that we should not unconditionally write headers, but I think
that we should write headers by default, much like we read them by default.

I believe the implicit header writing is very elegant, and the only
reason that the DictWriter object doesn't write headers is the impedance
mismatch between dicts and CSV.  namedtuples has the field order
information, the impedance mismatch is gone, we should no longer be
hindered.  Implicitly reading but not explicitly writing headers just
seems wrong.

It also seems wrong to require the construction of "header" namedtuple
objects.  It's much less natural than dicts holding identity mappings.

>>> Point._make(Point._fields)
Point(x='x', y='y')

To me, that just looks weird and non-obvious to me.  That Point instance
doesn't really fit in my mind as something that should be a Point.
msg82798 - (view) Author: Skip Montanaro (skip.montanaro) * (Python committer) Date: 2009-02-27 00:55
Rob> I agree that we should not unconditionally write headers, but I
    Rob> think that we should write headers by default, much like we read
    Rob> them by default.

I don't think you should write them by default.  I've worked with lots of
CSV files which have no headers.  I can imagine people wanting to write CSV
files with multiple headers.  It should be optional and explicit.

Skip
msg82799 - (view) Author: Skip Montanaro (skip.montanaro) * (Python committer) Date: 2009-02-27 01:00
More concretely, I don't think this is so onerous:

    names = ["col1", "col2", "color"]
    writer = csv.DictWriter(open("f.csv", "wb"), fieldnames=names, ...)
    writer.writerow(dict(zip(names, names)))
    ...

or

    f = open("f.csv", "rb")
    names = csv.reader(f).next()
    reader = csv.DictReader(f, fieldnames=names, ...)
    ...

Skip
msg82812 - (view) Author: Rob Renaud (rrenaud) Date: 2009-02-27 02:16
I did a search on Google code for the DictReader constructor.  I
analyzed the first 3 pages, the fieldnames parameter was used in 14 of
27 cases (discounting unittest code built into Python) and was not
used in 13 of 27 cases.  I suppose that means headered csv files are
sufficiently rare that they shouldn't be created implicitly by
default.  I still don't like the lack of symmetry of supporting
implicit header reads, but not implicit header writes.

On Thu, Feb 26, 2009 at 8:00 PM, Skip Montanaro <report@bugs.python.org> wrote:
>
> Skip Montanaro <skip@pobox.com> added the comment:
>
> More concretely, I don't think this is so onerous:
>
>    names = ["col1", "col2", "color"]
>    writer = csv.DictWriter(open("f.csv", "wb"), fieldnames=names, ...)
>    writer.writerow(dict(zip(names, names)))
>    ...
>
> or
>
>    f = open("f.csv", "rb")
>    names = csv.reader(f).next()
>    reader = csv.DictReader(f, fieldnames=names, ...)
>    ...
>
> Skip
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue1818>
> _______________________________________
>
msg82814 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2009-02-27 02:43
> I don't think you should write them by default.  
> I've worked with lots of CSV files which have no headers. 

My experience has been the same as Skips.
msg82819 - (view) Author: Skip Montanaro (skip.montanaro) * (Python committer) Date: 2009-02-27 04:45
Rob> I still don't like the lack of symmetry of supporting implicit
    Rob> header reads, but not implicit header writes.

A header is nothing more than a row in the CSV file with special
interpretation applied by the user.  There is nothing implicit about it.
If you know the first line is a header, use the recipe I posted.  If not,
supply your own fieldnames and treat the first row as data.

Skip
msg83298 - (view) Author: Jervis Whitley (jdwhitley) Date: 2009-03-08 04:34
Added a patch against py3k branch.

in csv.rst removed reference to reader.next() as a public method.
msg83299 - (view) Author: Skip Montanaro (skip.montanaro) * (Python committer) Date: 2009-03-08 04:40
Jervis> in csv.rst removed reference to reader.next() as a public method.

Because?  I've not seen any discussion in this issue or in any other forums
(most certainly not on the csv@python.org mailing list) which would suggest
that csv.reader's next() method should no longer be a public method.

Skip
msg83310 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-03-08 13:33
I don't understand why NamedTupleReader requires the fieldnames array
rather than the namedtuple class itself. If you could pass it the
namedtuple class, users could choose whatever namedtuple subclass with
whatever additional methods or behaviour suits them. It would make
NamedTupleReader more flexible and more useful.
msg83318 - (view) Author: Skip Montanaro (skip.montanaro) * (Python committer) Date: 2009-03-08 19:13
I don't know how NamedTuple objects work, but in many situations you
want the content of the CSV file to drive the output.  I would think
you would use a technique similar to my DictReader example to tell
the NamedTupleReader the fieldnames.  For that you need a fieldnames
argument.
msg83321 - (view) Author: Skip Montanaro (skip.montanaro) * (Python committer) Date: 2009-03-08 19:21
I retract my previous comment.  I don't use the DictReader the way it
operates (fieldnames==None => first row is a header) and forgot about
that behavior.
msg83332 - (view) Author: Jervis Whitley (jdwhitley) Date: 2009-03-08 22:53
Jervis> in csv.rst removed reference to reader.next() as a public method.

Skip> Because?  I've not seen any discussion in this issue or in any
Skip> other forums
Skip> (most certainly not on the csv@python.org mailing list) which
would Skip> suggest
Skip> that csv.reader's next() method should no longer be a public method.

I agree, this should be applied separately.
msg83333 - (view) Author: Jervis Whitley (jdwhitley) Date: 2009-03-08 23:13
Antoine> I don't understand why NamedTupleReader requires the 
Antoine> fieldnames array
Antoine> rather than the namedtuple class itself. If you could pass it
Antoine> the namedtuple class, users could choose whatever namedtuple 
Antoine> subclass with whatever additional methods or behaviour suits
Antoine> them. It would make NamedTupleReader more flexible and more 
Antoine> useful.

The NamedTupleReader does take the namedtuple class as the fieldnames
argument. It can be a namedtuple, a 'fieldnames' array or None. 
If a namedtuple is used as the fieldnames argument, returned rows are
created using ._make from the this namedtuple. Unless I have read your
requirements incorrectly, this is the behaviour you describe.

Given the confusion, I accept that the documentation needs to be improved. 

The NamedTupleReader and Writer were created to follow as closely as
possible the behaviour (and signature) of the DictReader and DictWriter,
with the exception of using namedtuples instead of dicts.
msg83334 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-03-08 23:21
Ok, I got misled by the documentation ("The contents of *fieldnames* are
passed directly to be used as the namedtuple fieldnames"), and your
implementation is a bit difficult to follow.
msg83340 - (view) Author: Jervis Whitley (jdwhitley) Date: 2009-03-09 00:06
Updated version of docs for 2.7 and 3k.
msg102936 - (view) Author: √Čric Araujo (eric.araujo) * (Python committer) Date: 2010-04-12 10:57
See also this python-ideas thread: http://mail.python.org/pipermail/python-ideas/2010-April/006991.html
msg102959 - (view) Author: Skip Montanaro (skip.montanaro) * (Python committer) Date: 2010-04-12 17:08
Type conversion is a whole 'nuther kettle of fish.  This particular thread is long and complex enough that it shouldn't be made more complex.
msg110598 - (view) Author: Mark Lawrence (BreamoreBoy) Date: 2010-07-17 19:18
I suggest that this is closed unless anyone shows an active interest in it.
msg111523 - (view) Author: Mark Lawrence (BreamoreBoy) Date: 2010-07-25 08:38
Closing as no response to msg110598.
msg111552 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2010-07-25 19:12
Re-opening because we ought to do something along these lines at some point.  The DictReader and DictWriter are inadequate for preserving order and they are unnecessarily memory intensive (one dict per record).  

FWIW, the non-conforming field name problem has already been solved by recent improvements to collections.namedtuple using rename=True.
msg115348 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2010-09-02 00:33
Unassigning, this needs fresh thought and a fresh patch from someone who can devote a little deep thinking on how to solve this problem cleanly.  In the meantime, it is no problem to simply cast the CSV tuples into named tuples.
History
Date User Action Args
2014-02-03 19:05:18BreamoreBoysetnosy: - BreamoreBoy
2012-12-14 03:53:54asvetlovsetnosy: + asvetlov
2012-09-06 12:06:51ainur0160setnosy: - ainur0160
2012-09-06 11:48:56ainur0160setnosy: + ainur0160
2010-09-02 00:33:39rhettingersetpriority: normal -> low
versions: - Python 3.2
messages: + msg115348

assignee: rhettinger ->
stage: patch review -> needs patch
2010-07-25 19:12:12rhettingersetstatus: closed -> open
assignee: barry -> rhettinger
messages: + msg111552
2010-07-25 08:38:11BreamoreBoysetstatus: pending -> closed

messages: + msg111523
2010-07-17 19:18:15BreamoreBoysetstatus: open -> pending
versions: + Python 3.2, Python 3.3, - Python 3.1, Python 2.7
nosy: + BreamoreBoy

messages: + msg110598
2010-05-20 20:38:50skip.montanarosetnosy: - skip.montanaro
2010-04-12 17:08:27skip.montanarosetmessages: + msg102959
2010-04-12 10:57:47eric.araujosetnosy: + eric.araujo
messages: + msg102936
2009-03-09 00:08:01jdwhitleysetfiles: - ntreader5_py3_1.diff
2009-03-09 00:07:46jdwhitleysetfiles: + ntreader6_py27.diff
2009-03-09 00:07:04jdwhitleysetfiles: + ntreader6_py3.diff

messages: + msg83340
2009-03-08 23:21:31pitrousetmessages: + msg83334
2009-03-08 23:13:33jdwhitleysetmessages: + msg83333
2009-03-08 22:53:06jdwhitleysetfiles: + ntreader5_py3_1.diff

messages: + msg83332
2009-03-08 19:21:00skip.montanarosetmessages: + msg83321
2009-03-08 19:19:42skip.montanarosetmessages: - msg83319
2009-03-08 19:18:15skip.montanarosetmessages: + msg83319
2009-03-08 19:13:24skip.montanarosetmessages: + msg83318
2009-03-08 13:33:50pitrousetnosy: + pitrou
messages: + msg83310
2009-03-08 04:40:59skip.montanarosetmessages: + msg83299
2009-03-08 04:34:07jdwhitleysetfiles: + ntreader4_py3_1.diff
messages: + msg83298
2009-02-27 04:45:49skip.montanarosetmessages: + msg82819
2009-02-27 02:43:22rhettingersetmessages: + msg82814
2009-02-27 02:16:55rrenaudsetmessages: + msg82812
2009-02-27 01:00:23skip.montanarosetmessages: + msg82799
2009-02-27 00:55:40skip.montanarosetmessages: + msg82798
2009-02-26 22:18:40rrenaudsetmessages: + msg82780
2009-02-26 21:02:58jdwhitleysetmessages: + msg82778
2009-02-26 19:04:27skip.montanarosetmessages: + msg82771
2009-02-26 19:02:03skip.montanarosetmessages: + msg82770
2009-02-26 15:47:25barrysetmessages: + msg82765
2009-02-26 15:44:54skip.montanarosetmessages: + msg82764
2009-02-26 08:01:10rhettingersettype: enhancement
stage: patch review
messages: + msg82746
versions: + Python 3.1, Python 2.7, - Python 2.6
2009-02-26 07:59:24rrenaudsetfiles: - named_tuple_write_header.patch
2009-02-26 07:59:15rrenaudsetfiles: + named_tuple_write_header2.patch
messages: + msg82745
2009-02-26 07:38:36rrenaudsetfiles: + named_tuple_write_header.patch
nosy: + rrenaud
messages: + msg82744
2009-02-10 11:08:09jdwhitleysetfiles: + ntreader4.diff
messages: + msg81537
2009-02-10 01:25:16rhettingersetmessages: + msg81518
2009-02-09 16:54:00rhettingersetmessages: + msg81464
2009-02-09 09:25:00jdwhitleysetfiles: + ntreader3.diff
nosy: + jdwhitley
messages: + msg81453
keywords: + patch
2008-01-22 20:12:41skip.montanarosetnosy: + skip.montanaro
messages: + msg61532
2008-01-22 19:25:56rhettingersetmessages: + msg61523
2008-01-13 22:27:14rhettingercreate