classification
Title: csv sniffer does not recognize quotes at the end of line
Type: behavior Stage: patch review
Components: Library (Lib) Versions: Python 3.1, Python 3.2, Python 2.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Martin.Budaj, andrewmcnamara, ned.deily, r.david.murray
Priority: normal Keywords: patch

Created on 2010-11-23 17:55 by Martin.Budaj, last changed 2019-03-15 22:45 by BreamoreBoy.

Files
File name Uploaded Description Edit
fix.tar.gz Martin.Budaj, 2010-11-24 21:06
test.py ned.deily, 2010-11-24 22:00 file 1 of 3 from fix.tar.gz
p1.patch ned.deily, 2010-11-24 22:01 file 2 of 3 from fix.tar.gz
p2.patch ned.deily, 2010-11-24 22:04 file 3 of 3 from tix.tar.gz
test_csv.patch Martin.Budaj, 2010-12-04 22:41
csv_delimiter_tests.patch r.david.murray, 2010-12-11 03:47 review
csv_delimiter_tests.patch skip.montanaro, 2010-12-19 16:00 review
Messages (16)
msg122227 - (view) Author: Martin Budaj (Martin.Budaj) Date: 2010-11-23 17:55
The method Sniffer._guess_quote_and_delimiter() in the module csv.py contains a bug in a regexp which checks for quotes around the last item of the line (example: a,b,"c,d"\n).

the pattern
'(?P<delim>>[^\w\n"\'])(?P<space> ?)(?P<quote>["\']).*?(?P=quote)(?:$|\n)'
used in the regex should be changed to
'(?P<delim>[^\w\n"\'])(?P<space> ?)(?P<quote>["\']).*?(?P=quote)(?:$|\n)'

file csv.py: line 212 in python 2.6, line 216 in 2.7, line 207 in 3.1
msg122230 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2010-11-23 18:37
Thanks for the report. It would be helpful if you could supply a patch including a unit test for this against 3.2 and/or 2.7.  Note only security issues are accepted for 2.6.
msg122312 - (view) Author: Martin Budaj (Martin.Budaj) Date: 2010-11-24 21:06
Units test and two patches for 2.7 are included.

p1.patch fixes testEnd case reported yesterday

After running unittest it seems that also other case is broken (testAl -- if there is just one data item on the line, enclosed in quotes). Patch p2 fixes it, but should be checked for side-effects.
msg122316 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2010-11-24 22:04
(Thanks. I've unpacked and uploaded the three files from your tar file.)
msg123100 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-12-02 18:48
Are you sure testA1 is correct?  It seems to me that in that case the sniffer can indeed not determine the delimiter, but I don't really understand the guessing algorithm.  The existing behavior on unquoted strings is...interesting :)

Also if you are willing to take the time it would be helpful to have the tests as a patch against test_csv.
msg123406 - (view) Author: Martin Budaj (Martin.Budaj) Date: 2010-12-04 22:41
I'm not sure about what the intended behavior for testAl should be, however I think that the file should be recognized as having one column of data and no delimiter (there is a test for this case in csv.py) and not raise an exception.

I attach patch for test_csv.py tests. It includes two more tests which currently fail (both contain \r\n as a newline)
msg123772 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-12-11 03:46
What do you mean by "there is a test for this case in csv.py"?  If I run sniffer against "abcde\ndefgh\n" I get a delimiter of 'e'.  If I run it against 'a\nb\n', I get the could not determine delimiter error.

Attached is a reformulated test case that lets us see the errors individually and easily add more.  With this set of 10 tests, two pass without your patches (as you knew; those are your tests), and we have 4 failures and 4 errors.  With p1 applied we have 4 failures and 3 errors.  With p2 applied we have 5 failures.

As I said, I'm not at all sure what the results *should* be for various of these cases.  Certainly the long standing existing behavior seems to be to return one of the characters in an unquoted multicharacter sequence. 

That said, it seems like your patch p1 is good, but as you discovered not sufficient.
msg123773 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-12-11 03:47
Forgot to attach the patch.
msg123787 - (view) Author: Skip Montanaro (skip.montanaro) * (Python triager) Date: 2010-12-11 12:32
From the comment in the test_csv.py:

+        # XXX: I don't know what the correct behavior should be for these.
+        # Currently the first one raises an error that the delimiter can't
+        # be determined while the second one returns '\r'.  The second
+        # is obviously.
+        ('"a,b,c,d"\ne',  ''),
+        ('"a,b,c,d"\r\ne', ''),

Obviously what?  My guess would be "wrong".  In the absence of any other
information \r\n has to be treated as a line separator.  It shouldn't be
considered as two separate characters, even in such a devoid-of-clues test
case.

Is the empty string a valid delimiter?  I've never tried it and it's been a
long time since I looked at any of the code.  I do use single-column CSV
files from time-to-time though.  I don't think a ',' would be a completely
unreasonable fallback scenario either.

Skip
msg123791 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-12-11 14:10
Yeah, obviously wrong.  I forgot to finish editing the comment.

I think a fallback of ',' makes more sense than ''.  What would a delimiter of nothing mean?  I don't think the unquoted case can be changed for backward compatibility reasons, so those tests I added should presumably be changed to confirm the existing behavior (with an appropriate comment).
msg123949 - (view) Author: Martin Budaj (Martin.Budaj) Date: 2010-12-14 14:54
> What do you mean by "there is a test for this case in csv.py"?  

I meant test in regex on line 217 in python 2.7 and the following code (line 258ff):

# there is *no* delimiter, it's a single column of quoted data
delim = ''
skipinitialspace = 0

However, it is intended to detect just lines starting and ending with quotes.
msg124354 - (view) Author: Skip Montanaro (skip.montanaro) * (Python triager) Date: 2010-12-19 14:19
I get two failures with David's latest patch.  Abstracting from
a lightly modified patch:

    sample was: 'a,b,"c,d"\r\ne,f,g', got: '', expected: ','
    sample was: '"a,b,c,d"\r\ne', got: '\r', expected: ''

In both cases I think the expected delimiter should be comma.
The code or regular expressions need to be fixed for the first
failure.  As I indicated earlier \r\n should never be split.
Also, for a single column CSV file I think the expected/default
delimiter should be a comma.

That said, does anyone think I should change the expected delimiter
for the first two tests at this point for 2.7?  The inputs and
expected delimiters are

    ('abcde', 'e'),
    ('a', 'e'),

Both seem wrong to me, but I understand that the behavior of 2.7
probably shouldn't change to accommodate fixing broken test cases.

I'm also a bit concerned about this construct:

    locals()['some name'] = f

I thought locals() wasn't guaranteed to actually refer to a 
persistent dictionary.  That is, it's fine if locals() builds
its dictionary on-the-fly then discards it.  Do we know for
a fact that when called during a class definition that locals()
returns the class's dictionary?  (This might only matter if
other dialects of Python try to reuse test_csv.py.)
msg124359 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-12-19 15:16
I agree that the unquoted single column cases look weird. But changing it could affect other cases were there is more data.  I'll leave that problem to you :)  But I do not think that should be backported.

The locals trick I stole from Barry. I think there ought to be a way to get at the class dict as the class is being built. But you are right that the docs imply that locals is not guaranteed to be it.  Perhaps a topic for python-dev.
msg124361 - (view) Author: Skip Montanaro (skip.montanaro) * (Python triager) Date: 2010-12-19 15:55
Also, this comment in test_csv.py puzzles me:

        # given that all three lines in sample3 are equal,
        # I think that any character could have been 'guessed' as the
        # delimiter, depending on dictionary order

As a human looking at sample3 it's obvious the '?' should
be the delimiter.  I can understand it picking '/' instead,
but it shouldn't punt.  (I stumbled upon this while trying
to make "," be the default delimiter instead of None when
none of the regular expressions matched.)
msg124362 - (view) Author: Skip Montanaro (skip.montanaro) * (Python triager) Date: 2010-12-19 16:00
Here's my candidate patch.  Instead of returning an empty string
as the delimiter it returns a comma.
msg236542 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2015-02-24 20:46
@David could you check Skip's patch out please.  It seems identical to your own other than changing the delimiter from the empty string to a comma.
History
Date User Action Args
2019-03-15 22:45:33BreamoreBoysetnosy: - BreamoreBoy
2015-02-24 20:46:15BreamoreBoysetnosy: + BreamoreBoy
messages: + msg236542
2011-03-19 19:03:32skip.montanarosetassignee: skip.montanaro ->

nosy: - skip.montanaro
2010-12-19 16:00:41skip.montanarosetfiles: + csv_delimiter_tests.patch
nosy: skip.montanaro, andrewmcnamara, ned.deily, r.david.murray, Martin.Budaj
messages: + msg124362
2010-12-19 15:55:40skip.montanarosetnosy: skip.montanaro, andrewmcnamara, ned.deily, r.david.murray, Martin.Budaj
messages: + msg124361
2010-12-19 15:16:07r.david.murraysetnosy: skip.montanaro, andrewmcnamara, ned.deily, r.david.murray, Martin.Budaj
messages: + msg124359
2010-12-19 14:19:39skip.montanarosetnosy: skip.montanaro, andrewmcnamara, ned.deily, r.david.murray, Martin.Budaj
messages: + msg124354
2010-12-16 08:56:27rhettingersetassignee: skip.montanaro
nosy: skip.montanaro, andrewmcnamara, ned.deily, r.david.murray, Martin.Budaj
2010-12-14 14:54:39Martin.Budajsetmessages: + msg123949
2010-12-11 14:10:41r.david.murraysetmessages: + msg123791
2010-12-11 12:32:23skip.montanarosetmessages: + msg123787
2010-12-11 03:47:57r.david.murraysetfiles: + csv_delimiter_tests.patch

messages: + msg123773
2010-12-11 03:46:52r.david.murraysetmessages: + msg123772
2010-12-04 22:41:06Martin.Budajsetfiles: + test_csv.patch

messages: + msg123406
2010-12-02 18:48:43r.david.murraysetnosy: + r.david.murray, skip.montanaro
messages: + msg123100
2010-11-24 22:04:22ned.deilysetfiles: + p2.patch

messages: + msg122316
stage: test needed -> patch review
2010-11-24 22:01:41ned.deilysetfiles: + p1.patch
keywords: + patch
2010-11-24 22:00:25ned.deilysetfiles: + test.py
2010-11-24 21:06:59Martin.Budajsetfiles: + fix.tar.gz

messages: + msg122312
2010-11-24 00:56:15andrewmcnamarasetnosy: + andrewmcnamara
2010-11-23 18:37:42ned.deilysetversions: + Python 3.2, - Python 2.6
nosy: + ned.deily

messages: + msg122230

stage: test needed
2010-11-23 17:55:40Martin.Budajcreate