classification
Title: csv.Sniffer does not detect lineterminator
Type: enhancement Stage: test needed
Components: Library (Lib) Versions: Python 3.8
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Gertjan van den Burg, nascheme, skip.montanaro, terry.reedy, vmax
Priority: normal Keywords:

Created on 2017-07-01 21:23 by vmax, last changed 2018-11-08 21:16 by skip.montanaro.

Pull Requests
URL Status Linked Edit
PR 2529 open vmax, 2017-07-01 22:01
Messages (7)
msg297497 - (view) Author: Max Vorobev (vmax) * Date: 2017-07-01 21:23
Line terminator defaults to '\r\n' while detecting dialect in csv.Sniffer
msg311804 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2018-02-07 21:40
The csv expert listed in https://devguide.python.org/experts/ is marked as inactive, and I have never used the module.  So you might need to ask for help on core-mentorship list.

The csv doc for Sniffer.sniff says "Analyze the given sample and return a Dialect subclass reflecting the parameters found."  It is not clear to me whether 'the parameters found' is meant to be all possible parameters or just those found.  So, to be conservative, I will initially treat this an a feature addition for the the next version, rather than a bug to also be fixed in current versions.  It does seem like a reasonable request.
msg311805 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2018-02-07 21:45
Looking at the code and docstring, lineterminator was intentionally (knowingly) not sniffed, making this a feature addition.
msg311806 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2018-02-07 21:55
While Sniffer *returns* a dialect with lineterminator = '\r\n', it *uses* '\n' for splitting.  This is slightly odd, as it leaves lines terminated by '\r' while detecting within-line parameters, but it does not affect such detection.

Are there csv files in the wild that use \r as line terminator.  If so, they will not currently get split.
msg327094 - (view) Author: Neil Schemenauer (nascheme) * (Python committer) Date: 2018-10-04 22:33
There is another issue related to this.  If you use codecs to get a reader, it uses str.splitlines() internally, which treats a bunch of different characters as line terminators.  See issue #18291 and:

https://docs.python.org/3.8/library/stdtypes.html#str.splitlines

I was thinking about different ways to fix this.  First, the csv module suggests you pass newline='' to the file object.  I suspect most people don't know to do that.  So, I thought maybe the csv module should inspect the file object that gets passed in and then warn if newline='' has not been used or if the file is a codecs reader object.

However, that seems fairly complicated.  Would it be better if we changed the 'csv' module to do its own line splitting?  I think that would be better although I'm not sure about backwards compatibly.  Currently, the reader expects to call iter() on the input file.  Would it be okay if it used the 'read' method of it in preference to using iter()?  It could still fallback to iter() if there was no read method.
msg329482 - (view) Author: Gertjan van den Burg (Gertjan van den Burg) Date: 2018-11-08 17:27
Note that the current CSV parser in _csv.c doesn't require the line terminator, it eats up \r and \n where necessary. See: 

https://github.com/python/cpython/blob/fd512d76456b65c529a5bc58d8cfe73e4a10de7a/Modules/_csv.c#L752

This is why the line terminator isn't detected and doesn't need to be detected.

Also, files that use the \r line terminator exist and are parsed correctly at the moment. See for example: https://raw.githubusercontent.com/hadley/data-fuel-economy/master/1998-2008/2008.csv
msg329487 - (view) Author: Skip Montanaro (skip.montanaro) * (Python triager) Date: 2018-11-08 21:16
A couple comments.

1. Terry Reedy wrote:

> The csv expert listed in https://devguide.python.org/experts/ is marked as inactive

That would be me. I am indeed inactive w.r.t. fixing broken stuff, and
don't want to feel obligated to jump in with both feet when a CSV
ticket is raised. Still, I keep half an eye on things. If people are
actually interested in my opinion on such stuff, drop me a line.

2. Regarding the csv.Sniffer class... I've personally never found it
useful, and would be happy to see it deprecated. I occasionally define
a delimiter other than comma, and never specify the quotechar. (I've
never seen anything other than quotation marks used anyway.) As others
have indicated, the line terminator is kind of unnecessary with Python
3 (unless you need something really weird). If you actually need to
specify a delimiter, I think giving a set of candidate delimiters
would be sufficient. The first one encountered wins.

Maybe I'm just getting old and cranky, but deprecation is the fork in
the road I'd take, given the choice. Second choice would be to
simplify the delimiter sniffing logic and get rid of anything to do
with line terminators.

Skip
History
Date User Action Args
2018-11-08 21:16:12skip.montanarosetmessages: + msg329487
2018-11-08 17:27:53Gertjan van den Burgsetnosy: + Gertjan van den Burg
messages: + msg329482
2018-10-04 22:33:40naschemesetnosy: + nascheme
messages: + msg327094
2018-02-07 21:55:37terry.reedysetmessages: + msg311806
2018-02-07 21:45:02terry.reedysetmessages: + msg311805
2018-02-07 21:40:43terry.reedysetversions: + Python 3.8, - Python 3.6
nosy: + terry.reedy, skip.montanaro

messages: + msg311804

type: behavior -> enhancement
2017-07-08 02:26:13terry.reedysetstage: test needed
2017-07-01 22:01:12vmaxsetpull_requests: + pull_request2595
2017-07-01 21:23:44vmaxcreate