Issue 10515: csv sniffer does not recognize quotes at the end of line

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/54724

classification

Title:	csv sniffer does not recognize quotes at the end of line
Type:	behavior	Stage:	resolved
Components:	Library (Lib)	Versions:	Python 3.1, Python 3.2, Python 2.7

process

Status:	closed	Resolution:	duplicate
Dependencies:		Superseder:	csv.Sniffer.sniff() regex error View: 30157
Assigned To:		Nosy List:	Martin.Budaj, andrewmcnamara, ned.deily, r.david.murray
Priority:	normal	Keywords:	patch

Created on 2010-11-23 17:55 by Martin.Budaj, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
fix.tar.gz	Martin.Budaj, 2010-11-24 21:06
test.py	ned.deily, 2010-11-24 22:00	file 1 of 3 from fix.tar.gz
p1.patch	ned.deily, 2010-11-24 22:01	file 2 of 3 from fix.tar.gz
p2.patch	ned.deily, 2010-11-24 22:04	file 3 of 3 from tix.tar.gz
test_csv.patch	Martin.Budaj, 2010-12-04 22:41
csv_delimiter_tests.patch	r.david.murray, 2010-12-11 03:47		review
csv_delimiter_tests.patch	skip.montanaro, 2010-12-19 16:00		review

Messages (16)
msg122227 - (view)	Author: Martin Budaj (Martin.Budaj)	Date: 2010-11-23 17:55
The method Sniffer._guess_quote_and_delimiter() in the module csv.py contains a bug in a regexp which checks for quotes around the last item of the line (example: a,b,"c,d"\n). the pattern '(?P<delim>>[^\w\n"\'])(?P<space> ?)(?P<quote>["\']).?(?P=quote)(?:$\|\n)' used in the regex should be changed to '(?P<delim>[^\w\n"\'])(?P<space> ?)(?P<quote>["\']).?(?P=quote)(?:$\|\n)' file csv.py: line 212 in python 2.6, line 216 in 2.7, line 207 in 3.1
msg122230 - (view)	Author: Ned Deily (ned.deily) *	Date: 2010-11-23 18:37
Thanks for the report. It would be helpful if you could supply a patch including a unit test for this against 3.2 and/or 2.7. Note only security issues are accepted for 2.6.
msg122312 - (view)	Author: Martin Budaj (Martin.Budaj)	Date: 2010-11-24 21:06
Units test and two patches for 2.7 are included. p1.patch fixes testEnd case reported yesterday After running unittest it seems that also other case is broken (testAl -- if there is just one data item on the line, enclosed in quotes). Patch p2 fixes it, but should be checked for side-effects.
msg122316 - (view)	Author: Ned Deily (ned.deily) *	Date: 2010-11-24 22:04
(Thanks. I've unpacked and uploaded the three files from your tar file.)
msg123100 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2010-12-02 18:48
Are you sure testA1 is correct? It seems to me that in that case the sniffer can indeed not determine the delimiter, but I don't really understand the guessing algorithm. The existing behavior on unquoted strings is...interesting :) Also if you are willing to take the time it would be helpful to have the tests as a patch against test_csv.
msg123406 - (view)	Author: Martin Budaj (Martin.Budaj)	Date: 2010-12-04 22:41
I'm not sure about what the intended behavior for testAl should be, however I think that the file should be recognized as having one column of data and no delimiter (there is a test for this case in csv.py) and not raise an exception. I attach patch for test_csv.py tests. It includes two more tests which currently fail (both contain \r\n as a newline)
msg123772 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2010-12-11 03:46
What do you mean by "there is a test for this case in csv.py"? If I run sniffer against "abcde\ndefgh\n" I get a delimiter of 'e'. If I run it against 'a\nb\n', I get the could not determine delimiter error. Attached is a reformulated test case that lets us see the errors individually and easily add more. With this set of 10 tests, two pass without your patches (as you knew; those are your tests), and we have 4 failures and 4 errors. With p1 applied we have 4 failures and 3 errors. With p2 applied we have 5 failures. As I said, I'm not at all sure what the results should be for various of these cases. Certainly the long standing existing behavior seems to be to return one of the characters in an unquoted multicharacter sequence. That said, it seems like your patch p1 is good, but as you discovered not sufficient.
msg123773 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2010-12-11 03:47
Forgot to attach the patch.
msg123787 - (view)	Author: Skip Montanaro (skip.montanaro) *	Date: 2010-12-11 12:32
From the comment in the test_csv.py: + # XXX: I don't know what the correct behavior should be for these. + # Currently the first one raises an error that the delimiter can't + # be determined while the second one returns '\r'. The second + # is obviously. + ('"a,b,c,d"\ne', ''), + ('"a,b,c,d"\r\ne', ''), Obviously what? My guess would be "wrong". In the absence of any other information \r\n has to be treated as a line separator. It shouldn't be considered as two separate characters, even in such a devoid-of-clues test case. Is the empty string a valid delimiter? I've never tried it and it's been a long time since I looked at any of the code. I do use single-column CSV files from time-to-time though. I don't think a ',' would be a completely unreasonable fallback scenario either. Skip
msg123791 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2010-12-11 14:10
Yeah, obviously wrong. I forgot to finish editing the comment. I think a fallback of ',' makes more sense than ''. What would a delimiter of nothing mean? I don't think the unquoted case can be changed for backward compatibility reasons, so those tests I added should presumably be changed to confirm the existing behavior (with an appropriate comment).
msg123949 - (view)	Author: Martin Budaj (Martin.Budaj)	Date: 2010-12-14 14:54
> What do you mean by "there is a test for this case in csv.py"? I meant test in regex on line 217 in python 2.7 and the following code (line 258ff): # there is no delimiter, it's a single column of quoted data delim = '' skipinitialspace = 0 However, it is intended to detect just lines starting and ending with quotes.
msg124354 - (view)	Author: Skip Montanaro (skip.montanaro) *	Date: 2010-12-19 14:19
I get two failures with David's latest patch. Abstracting from a lightly modified patch: sample was: 'a,b,"c,d"\r\ne,f,g', got: '', expected: ',' sample was: '"a,b,c,d"\r\ne', got: '\r', expected: '' In both cases I think the expected delimiter should be comma. The code or regular expressions need to be fixed for the first failure. As I indicated earlier \r\n should never be split. Also, for a single column CSV file I think the expected/default delimiter should be a comma. That said, does anyone think I should change the expected delimiter for the first two tests at this point for 2.7? The inputs and expected delimiters are ('abcde', 'e'), ('a', 'e'), Both seem wrong to me, but I understand that the behavior of 2.7 probably shouldn't change to accommodate fixing broken test cases. I'm also a bit concerned about this construct: locals()['some name'] = f I thought locals() wasn't guaranteed to actually refer to a persistent dictionary. That is, it's fine if locals() builds its dictionary on-the-fly then discards it. Do we know for a fact that when called during a class definition that locals() returns the class's dictionary? (This might only matter if other dialects of Python try to reuse test_csv.py.)
msg124359 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2010-12-19 15:16
I agree that the unquoted single column cases look weird. But changing it could affect other cases were there is more data. I'll leave that problem to you :) But I do not think that should be backported. The locals trick I stole from Barry. I think there ought to be a way to get at the class dict as the class is being built. But you are right that the docs imply that locals is not guaranteed to be it. Perhaps a topic for python-dev.
msg124361 - (view)	Author: Skip Montanaro (skip.montanaro) *	Date: 2010-12-19 15:55
Also, this comment in test_csv.py puzzles me: # given that all three lines in sample3 are equal, # I think that any character could have been 'guessed' as the # delimiter, depending on dictionary order As a human looking at sample3 it's obvious the '?' should be the delimiter. I can understand it picking '/' instead, but it shouldn't punt. (I stumbled upon this while trying to make "," be the default delimiter instead of None when none of the regular expressions matched.)
msg124362 - (view)	Author: Skip Montanaro (skip.montanaro) *	Date: 2010-12-19 16:00
Here's my candidate patch. Instead of returning an empty string as the delimiter it returns a comma.
msg236542 - (view)	Author: Mark Lawrence (BreamoreBoy) *	Date: 2015-02-24 20:46
@David could you check Skip's patch out please. It seems identical to your own other than changing the delimiter from the empty string to a comma.

History
Date	User	Action	Args
2022-04-11 14:57:09	admin	set	github: 54724
2022-03-01 20:03:14	iritkatriel	set	status: open -> closed superseder: csv.Sniffer.sniff() regex error resolution: duplicate stage: patch review -> resolved
2019-03-15 22:45:33	BreamoreBoy	set	nosy: - BreamoreBoy
2015-02-24 20:46:15	BreamoreBoy	set	nosy: + BreamoreBoy messages: + msg236542
2011-03-19 19:03:32	skip.montanaro	set	assignee: skip.montanaro -> nosy: - skip.montanaro
2010-12-19 16:00:41	skip.montanaro	set	files: + csv_delimiter_tests.patch nosy: skip.montanaro, andrewmcnamara, ned.deily, r.david.murray, Martin.Budaj messages: + msg124362
2010-12-19 15:55:40	skip.montanaro	set	nosy: skip.montanaro, andrewmcnamara, ned.deily, r.david.murray, Martin.Budaj messages: + msg124361
2010-12-19 15:16:07	r.david.murray	set	nosy: skip.montanaro, andrewmcnamara, ned.deily, r.david.murray, Martin.Budaj messages: + msg124359
2010-12-19 14:19:39	skip.montanaro	set	nosy: skip.montanaro, andrewmcnamara, ned.deily, r.david.murray, Martin.Budaj messages: + msg124354
2010-12-16 08:56:27	rhettinger	set	assignee: skip.montanaro nosy: skip.montanaro, andrewmcnamara, ned.deily, r.david.murray, Martin.Budaj
2010-12-14 14:54:39	Martin.Budaj	set	messages: + msg123949
2010-12-11 14:10:41	r.david.murray	set	messages: + msg123791
2010-12-11 12:32:23	skip.montanaro	set	messages: + msg123787
2010-12-11 03:47:57	r.david.murray	set	files: + csv_delimiter_tests.patch messages: + msg123773
2010-12-11 03:46:52	r.david.murray	set	messages: + msg123772
2010-12-04 22:41:06	Martin.Budaj	set	files: + test_csv.patch messages: + msg123406
2010-12-02 18:48:43	r.david.murray	set	nosy: + r.david.murray, skip.montanaro messages: + msg123100
2010-11-24 22:04:22	ned.deily	set	files: + p2.patch messages: + msg122316 stage: test needed -> patch review
2010-11-24 22:01:41	ned.deily	set	files: + p1.patch keywords: + patch
2010-11-24 22:00:25	ned.deily	set	files: + test.py
2010-11-24 21:06:59	Martin.Budaj	set	files: + fix.tar.gz messages: + msg122312
2010-11-24 00:56:15	andrewmcnamara	set	nosy: + andrewmcnamara
2010-11-23 18:37:42	ned.deily	set	versions: + Python 3.2, - Python 2.6 nosy: + ned.deily messages: + msg122230 stage: test needed
2010-11-23 17:55:40	Martin.Budaj	create