Issue 1225769: Proposal to implement comment rows in csv module

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/42112

classification

Title:	Proposal to implement comment rows in csv module
Type:		Stage:
Components:	Library (Lib)	Versions:	Python 2.5

process

Status:	closed	Resolution:	rejected
Dependencies:		Superseder:
Assigned To:	andrewmcnamara	Nosy List:	Mario.Fasold, amaury.forgeotdarc, andrewmcnamara, eric.araujo, iain_haslam, kynan, orsenthil, pitrou, rhettinger, skip.montanaro
Priority:	normal	Keywords:	patch

Created on 2005-06-22 19:48 by iain_haslam, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
csv.diff	iain_haslam, 2005-06-22 19:48	Proposed diffs to implement commented records.
csv_comment_doc_test.diff	iain_haslam, 2005-06-30 23:40	Test and documentation diff

Messages (20)
msg48499 - (view)	Author: Iain Haslam (iain_haslam)	Date: 2005-06-22 19:48
Sometimes csv files contain comment rows, for temporarily commenting out data or occasionally for documentation. The current csv module has no built-in ability to skip rows; in order to skip all lines beginning with '#', the programmer writes something like: csv_reader = csv.reader(fp) for row in csv_reader: if row[0][0] != '#': #assuming no blank lines print row I propose adding a "commentchar" parameter to the csv parser, so that the above code could be written (more elegantly, in my opinion): csv_reader = csv.reader(fp, commentchar='#') for row in csv_reader: print row This requires only relatively minor changes to the module, and by defaulting to using no comment character, existing code will behave as before. If you are interested, the patch (diffs against current cvs) required for the second example to run are attached. Note that that implementation adds SKIPPED_RECORD as a pseudonym for START_RECORD, because setting status to START_RECORD after skipping a record would cause a blank record to be returned. Altering that behaviour would cause more changes and the patch would be harder to review. I've also held back on updating tests and documentation to reflect this change, pending any support for it. It shoud be irrelevant, but this has been developed on Debian testing against the cvs head of Python.
msg48500 - (view)	Author: Skip Montanaro (skip.montanaro) *	Date: 2005-06-25 19:39
Logged In: YES user_id=44345 I'm not inclined to clutter the C code with further complications. Why can't you implement this on an as-needed basis with a file object wrapper, a subclass of the csv.reader class, or just continue to use the example in your submission?
msg48501 - (view)	Author: Iain Haslam (iain_haslam)	Date: 2005-06-26 17:26
Logged In: YES user_id=1301296 > I'm not inclined to clutter the C code with further > complications. Sorry - I haven't been keeping up with the existing complications! Don't forget that one man's clutter is another man's functionality. It doesn't actually require much of a change to the code, although I was slightly suprised to discover that this module was in C in the first place... Basically, I noticed that the csv module has a bias towards Excel-generated csv files, but most of the time I've come across csv files, they were hand-edited, and I've often seen comment fields as described in the submission. My submission was intended in the "batteries included" spirit (I do understand that you stop short of the kitchen sink), and also seemed in-keeping with the 'skipinitialspace' option within the existing csv module. > Why can't you implement this on an as-needed basis > with a file object wrapper [other options] True, I could do any of those things, but it would be simpler / clearer not to have to. Of course, if you took your argument further, you could cut chunks out of several modules; the argument comes down to whether the benefits outweigh the additional complexity. I was suprised to discover the option wasn't already there, but maybe that's just me. In any case, if your vote goes from your apparent -0 to -1, that's your choice, and you're better placed to make it than I am. Cheers, Iain.
msg48502 - (view)	Author: Skip Montanaro (skip.montanaro) *	Date: 2005-06-26 18:02
Logged In: YES user_id=44345 Something else just occurred to me. What about writing csv files with comments? Also, a tweak to the docs would be in order if this is accepted.
msg48503 - (view)	Author: Skip Montanaro (skip.montanaro) *	Date: 2005-06-27 17:25
Logged In: YES user_id=44345 Iain - There was some positive response to your patch from the csv mailing list (most notably from one of the authors of the C extension module). Can you provide diffs for the module documentation and necessary test cases to go along with your patch? Also, addressing the issue that CSV files with comments (probably?) won't round-trip would be a good thing to note in the docs. Skip
msg48504 - (view)	Author: Iain Haslam (iain_haslam)	Date: 2005-06-30 23:40
Logged In: YES user_id=1301296 Here are the documentation and test diffs. I'm glad to hear of the positive feedback. I couldn't find the csv mailing list (I assume it's not public), so didn't see the discussion of round-tripping, but I agree with the implied conclusion that flagging rows as comments would complicate the interface too much. On a related point, I noticed that the csv documentation is filed under "Internet data handling". This seems a little odd - I'd suggest moving it to "Miscellaneous Services" alongside fileinput and ConfigParser. Thanks, Iain.
msg48505 - (view)	Author: Skip Montanaro (skip.montanaro) *	Date: 2007-02-11 18:59
Sorry, I'm coming back to this after a long hiatus... I'm still not inclined to make this change to the C source. I think a) comments in CSV files are pretty rare and that b) implementing this using a file object wrapper would be more flexible. #!/usr/bin/env python import csv import StringIO class CommentedFile: def __init__(self, f, commentstring="#"): self.f = f self.commentstring = commentstring def next(self): line = self.f.next() while line.startswith(self.commentstring): line = self.f.next() return line def __iter__(self): return self f = StringIO.StringIO('''\ "event","performers","start","end","time" # Rachel Sage "","Rachael Sage","2008-01-03","2008-01-03","8:00pm" # Others "","Tung-N-GRoeVE","2008-01-16","2008-01-16","9:30pm-2:00am" "","Bossa Nova Beatniks","2007-11-11","2007-11-11","11:11pm" "","Special Consensus","2006-10-06","2006-10-06","" ''') for row in csv.reader(CommentedFile(f)): print row The output of the above is as expected: ['event', 'performers', 'start', 'end', 'time'] ['', 'Rachael Sage', '2008-01-03', '2008-01-03', '8:00pm'] ['', 'Tung-N-GRoeVE', '2008-01-16', '2008-01-16', '9:30pm-2:00am'] ['', 'Bossa Nova Beatniks', '2007-11-11', '2007-11-11', '11:11pm'] ['', 'Special Consensus', '2006-10-06', '2006-10-06', ''] This has the added benefit that comment lines aren't restricted to single character comment prefixes. On the downside, comment characters appearing at the beginning of a continuation line would trip this up. In practice, I don't think it would be a significant limitation. In almost all cases I suspect CSV files with embedded comments would be manually created and maintained and aren't likely to contain fields with embedded comments. Skip
msg65434 - (view)	Author: Skip Montanaro (skip.montanaro) *	Date: 2008-04-13 03:33
Assigning to Andrew (as the primary C lib author). Andrew, please comment ;-).
msg65505 - (view)	Author: Andrew McNamara (andrewmcnamara) *	Date: 2008-04-15 07:16
I think it's a reasonable enough request - I've certainly had to process CSV files with comments. Iain - appologies for not looking at your request before now - 3 years is a pretty poor response time. Some thoughts: * this should target 2.6, not 2.5 (as 2.5 is in maintenance only)? * the check that disallows "space" - I wonder what other things it should disallow - the delimiter? * should it support multicharacter comments, like the C++ //-style comment? I think no. * any implications for the py3k port? Skip - are you happy making the changes, or should I dust off my working copy?
msg65506 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2008-04-15 08:01
-1 on the change. Comments in CSV files are not a common use case. I'm less worried about cluttering the C code and more concerned about cluttering the API, making the module harder to learn and remember. Also, it is already trivial to implement filtering using str.startswith (). This proposed patch is not worth the rare case where it saves a line or two code that was already crystal clear.
msg98415 - (view)	Author: Mario Fasold (Mario.Fasold)	Date: 2010-01-27 11:46
Comment lines are a very common case in scientific and statistical data. +1 for the change.
msg98421 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *	Date: 2010-01-27 13:17
Comment lines in csv data may be common in some areas, but they are not part of any standard, and they are not the only possible extension to csv files (for example: ignore empty lines, or a terminal \ for line continuation...) Currently all members of Dialect deal with the format of the records, not with the extraction of records from the file. (lineterminator is used only when writing). The "CommentedFile" filter above is a good recipe for all cases, and is easy to use. I recommend closing this issue.
msg98422 - (view)	Author: Éric Araujo (eric.araujo) *	Date: 2010-01-27 13:25
Hello Shouldn't the comment char definition belong in a dialect class? The reader would still have to be modified to skip these lines, but having this char in the dialect would not require a change to csv.reader signature. Kind regards
msg98426 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *	Date: 2010-01-27 13:49
> Shouldn't the comment char definition belong in a dialect class? The proposed patch does this correctly; then csv.reader automatically accepts the dialect parameters to override the selected dialect. I'm still -1 on the feature - not standard enough, and easy to implement outside the csv module.
msg98436 - (view)	Author: Skip Montanaro (skip.montanaro) *	Date: 2010-01-27 16:07
Amaury> Comment lines in csv data may be common in some areas, but they Amaury> are not part of any standard, and they are not the only possible Amaury> extension to csv files (for example: ignore empty lines, or a Amaury> terminal \ for line continuation...) Or different peoples' notion of how to comment strings. Precidely because there is no standard way to comment out content in CSV files people are free to dream up anything they want, including, but not limited to: * C- or C++-style comments * MoinMoin-style comments, where "##" introduces a comment but a lone "#" introduces a meta command to the parser Trying to accommodate the myriad varieties of way s people might decide to comment out content would put an undue burden on the csv module's parser. Skip
msg98438 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2010-01-27 16:11
Agreed with Skip, Raymond and Amaury. Since the csv module returns you an iterator, it's easy enough to wrap it in another iterator.
msg98447 - (view)	Author: Skip Montanaro (skip.montanaro) *	Date: 2010-01-27 18:43
Antoine> Since the csv module returns you an iterator, it's easy enough Antoine> to wrap it in another iterator. I prefer to do this sort of stuff as a pre-processing step, so I generally wrap the file object input and use that iterator as the input "file" for the csv reader. I'm sure it's largely a matter of taste and what you need to do. Certainly wrapping the csv reader's output would be useful to provide sensible missing values. Skip
msg98807 - (view)	Author: Senthil Kumaran (orsenthil) *	Date: 2010-02-04 00:50
Here is another -1 for this proposed feature. Having a comments in the csv fields and providing a way to deal will complicate matters more than required. Different suggestions of how to accomplish it has been suggested here. As others, I too recommend closing it. (It is assigned to andewmcnamara, so I guess, he would close it).
msg98808 - (view)	Author: Andrew McNamara (andrewmcnamara) *	Date: 2010-02-04 02:10
Okay, while I am sympathetic to the points raised by the people asking for this enhancement, I'm persuaded to reject it by the arguments that the potential benefit is outweighed by the increase in complexity (code and documentation). While the attached patch from Iain requires relatively minor and innocuous changes to the state machine, it only satisfies a limited subset of users, and the effect of complexity is geometric. In other words, to understand the state machine, you need to understand the relation of each state to every other state, etc. As to why the core of the module is implemented in C, the sort of stateful parsing done by the module is one area where python does not excel (Ha!). Stripping comments, on the other hand, CAN be done with reasonable efficiency from python (and considerably more flexibility). Also, it has been my experience that CSV files that use comments typically have other non-standard features as well, requiring additional pre- and post-processing in any case.
msg98812 - (view)	Author: Andrew McNamara (andrewmcnamara) *	Date: 2010-02-04 04:05
Note that there is one case that cannot easily be addressed via pre-processing: where the comment character coincidently appears at the start of a line within a multi-line quoted field. For example: # This is a comment 1, 2, "This is field^M #3" What this should produce is debatable, but it would be hard to make it produce: ["1", "2", "This is field^M#3"] without implementing it within the parser.

History
Date	User	Action	Args
2022-04-11 14:56:11	admin	set	github: 42112
2010-08-29 13:40:43	kynan	set	nosy: + kynan
2010-02-04 04:05:37	andrewmcnamara	set	messages: + msg98812
2010-02-04 02:10:23	andrewmcnamara	set	status: open -> closed resolution: rejected messages: + msg98808
2010-02-04 00:50:33	orsenthil	set	nosy: + orsenthil messages: + msg98807
2010-01-27 18:43:25	skip.montanaro	set	messages: + msg98447
2010-01-27 16:11:58	pitrou	set	nosy: + pitrou messages: + msg98438
2010-01-27 16:07:22	skip.montanaro	set	messages: + msg98436
2010-01-27 13:49:33	amaury.forgeotdarc	set	messages: + msg98426
2010-01-27 13:25:53	eric.araujo	set	nosy: + eric.araujo messages: + msg98422
2010-01-27 13:17:04	amaury.forgeotdarc	set	nosy: + amaury.forgeotdarc messages: + msg98421
2010-01-27 11:46:11	Mario.Fasold	set	nosy: + Mario.Fasold messages: + msg98415
2008-04-15 08:01:50	rhettinger	set	nosy: + rhettinger messages: + msg65506
2008-04-15 07:16:58	andrewmcnamara	set	messages: + msg65505
2008-04-13 03:33:03	skip.montanaro	set	assignee: skip.montanaro -> andrewmcnamara messages: + msg65434 nosy: + andrewmcnamara
2005-06-22 19:48:01	iain_haslam	create