msg122227 - (view) |
Author: Martin Budaj (Martin.Budaj) |
Date: 2010-11-23 17:55 |
The method Sniffer._guess_quote_and_delimiter() in the module csv.py contains a bug in a regexp which checks for quotes around the last item of the line (example: a,b,"c,d"\n).
the pattern
'(?P<delim>>[^\w\n"\'])(?P<space> ?)(?P<quote>["\']).*?(?P=quote)(?:$|\n)'
used in the regex should be changed to
'(?P<delim>[^\w\n"\'])(?P<space> ?)(?P<quote>["\']).*?(?P=quote)(?:$|\n)'
file csv.py: line 212 in python 2.6, line 216 in 2.7, line 207 in 3.1
|
msg122230 - (view) |
Author: Ned Deily (ned.deily) * |
Date: 2010-11-23 18:37 |
Thanks for the report. It would be helpful if you could supply a patch including a unit test for this against 3.2 and/or 2.7. Note only security issues are accepted for 2.6.
|
msg122312 - (view) |
Author: Martin Budaj (Martin.Budaj) |
Date: 2010-11-24 21:06 |
Units test and two patches for 2.7 are included.
p1.patch fixes testEnd case reported yesterday
After running unittest it seems that also other case is broken (testAl -- if there is just one data item on the line, enclosed in quotes). Patch p2 fixes it, but should be checked for side-effects.
|
msg122316 - (view) |
Author: Ned Deily (ned.deily) * |
Date: 2010-11-24 22:04 |
(Thanks. I've unpacked and uploaded the three files from your tar file.)
|
msg123100 - (view) |
Author: R. David Murray (r.david.murray) * |
Date: 2010-12-02 18:48 |
Are you sure testA1 is correct? It seems to me that in that case the sniffer can indeed not determine the delimiter, but I don't really understand the guessing algorithm. The existing behavior on unquoted strings is...interesting :)
Also if you are willing to take the time it would be helpful to have the tests as a patch against test_csv.
|
msg123406 - (view) |
Author: Martin Budaj (Martin.Budaj) |
Date: 2010-12-04 22:41 |
I'm not sure about what the intended behavior for testAl should be, however I think that the file should be recognized as having one column of data and no delimiter (there is a test for this case in csv.py) and not raise an exception.
I attach patch for test_csv.py tests. It includes two more tests which currently fail (both contain \r\n as a newline)
|
msg123772 - (view) |
Author: R. David Murray (r.david.murray) * |
Date: 2010-12-11 03:46 |
What do you mean by "there is a test for this case in csv.py"? If I run sniffer against "abcde\ndefgh\n" I get a delimiter of 'e'. If I run it against 'a\nb\n', I get the could not determine delimiter error.
Attached is a reformulated test case that lets us see the errors individually and easily add more. With this set of 10 tests, two pass without your patches (as you knew; those are your tests), and we have 4 failures and 4 errors. With p1 applied we have 4 failures and 3 errors. With p2 applied we have 5 failures.
As I said, I'm not at all sure what the results *should* be for various of these cases. Certainly the long standing existing behavior seems to be to return one of the characters in an unquoted multicharacter sequence.
That said, it seems like your patch p1 is good, but as you discovered not sufficient.
|
msg123773 - (view) |
Author: R. David Murray (r.david.murray) * |
Date: 2010-12-11 03:47 |
Forgot to attach the patch.
|
msg123787 - (view) |
Author: Skip Montanaro (skip.montanaro) * |
Date: 2010-12-11 12:32 |
From the comment in the test_csv.py:
+ # XXX: I don't know what the correct behavior should be for these.
+ # Currently the first one raises an error that the delimiter can't
+ # be determined while the second one returns '\r'. The second
+ # is obviously.
+ ('"a,b,c,d"\ne', ''),
+ ('"a,b,c,d"\r\ne', ''),
Obviously what? My guess would be "wrong". In the absence of any other
information \r\n has to be treated as a line separator. It shouldn't be
considered as two separate characters, even in such a devoid-of-clues test
case.
Is the empty string a valid delimiter? I've never tried it and it's been a
long time since I looked at any of the code. I do use single-column CSV
files from time-to-time though. I don't think a ',' would be a completely
unreasonable fallback scenario either.
Skip
|
msg123791 - (view) |
Author: R. David Murray (r.david.murray) * |
Date: 2010-12-11 14:10 |
Yeah, obviously wrong. I forgot to finish editing the comment.
I think a fallback of ',' makes more sense than ''. What would a delimiter of nothing mean? I don't think the unquoted case can be changed for backward compatibility reasons, so those tests I added should presumably be changed to confirm the existing behavior (with an appropriate comment).
|
msg123949 - (view) |
Author: Martin Budaj (Martin.Budaj) |
Date: 2010-12-14 14:54 |
> What do you mean by "there is a test for this case in csv.py"?
I meant test in regex on line 217 in python 2.7 and the following code (line 258ff):
# there is *no* delimiter, it's a single column of quoted data
delim = ''
skipinitialspace = 0
However, it is intended to detect just lines starting and ending with quotes.
|
msg124354 - (view) |
Author: Skip Montanaro (skip.montanaro) * |
Date: 2010-12-19 14:19 |
I get two failures with David's latest patch. Abstracting from
a lightly modified patch:
sample was: 'a,b,"c,d"\r\ne,f,g', got: '', expected: ','
sample was: '"a,b,c,d"\r\ne', got: '\r', expected: ''
In both cases I think the expected delimiter should be comma.
The code or regular expressions need to be fixed for the first
failure. As I indicated earlier \r\n should never be split.
Also, for a single column CSV file I think the expected/default
delimiter should be a comma.
That said, does anyone think I should change the expected delimiter
for the first two tests at this point for 2.7? The inputs and
expected delimiters are
('abcde', 'e'),
('a', 'e'),
Both seem wrong to me, but I understand that the behavior of 2.7
probably shouldn't change to accommodate fixing broken test cases.
I'm also a bit concerned about this construct:
locals()['some name'] = f
I thought locals() wasn't guaranteed to actually refer to a
persistent dictionary. That is, it's fine if locals() builds
its dictionary on-the-fly then discards it. Do we know for
a fact that when called during a class definition that locals()
returns the class's dictionary? (This might only matter if
other dialects of Python try to reuse test_csv.py.)
|
msg124359 - (view) |
Author: R. David Murray (r.david.murray) * |
Date: 2010-12-19 15:16 |
I agree that the unquoted single column cases look weird. But changing it could affect other cases were there is more data. I'll leave that problem to you :) But I do not think that should be backported.
The locals trick I stole from Barry. I think there ought to be a way to get at the class dict as the class is being built. But you are right that the docs imply that locals is not guaranteed to be it. Perhaps a topic for python-dev.
|
msg124361 - (view) |
Author: Skip Montanaro (skip.montanaro) * |
Date: 2010-12-19 15:55 |
Also, this comment in test_csv.py puzzles me:
# given that all three lines in sample3 are equal,
# I think that any character could have been 'guessed' as the
# delimiter, depending on dictionary order
As a human looking at sample3 it's obvious the '?' should
be the delimiter. I can understand it picking '/' instead,
but it shouldn't punt. (I stumbled upon this while trying
to make "," be the default delimiter instead of None when
none of the regular expressions matched.)
|
msg124362 - (view) |
Author: Skip Montanaro (skip.montanaro) * |
Date: 2010-12-19 16:00 |
Here's my candidate patch. Instead of returning an empty string
as the delimiter it returns a comma.
|
msg236542 - (view) |
Author: Mark Lawrence (BreamoreBoy) * |
Date: 2015-02-24 20:46 |
@David could you check Skip's patch out please. It seems identical to your own other than changing the delimiter from the empty string to a comma.
|
|
Date |
User |
Action |
Args |
2022-04-11 14:57:09 | admin | set | github: 54724 |
2022-03-01 20:03:14 | iritkatriel | set | status: open -> closed superseder: csv.Sniffer.sniff() regex error resolution: duplicate stage: patch review -> resolved |
2019-03-15 22:45:33 | BreamoreBoy | set | nosy:
- BreamoreBoy
|
2015-02-24 20:46:15 | BreamoreBoy | set | nosy:
+ BreamoreBoy messages:
+ msg236542
|
2011-03-19 19:03:32 | skip.montanaro | set | assignee: skip.montanaro ->
nosy:
- skip.montanaro |
2010-12-19 16:00:41 | skip.montanaro | set | files:
+ csv_delimiter_tests.patch nosy:
skip.montanaro, andrewmcnamara, ned.deily, r.david.murray, Martin.Budaj messages:
+ msg124362
|
2010-12-19 15:55:40 | skip.montanaro | set | nosy:
skip.montanaro, andrewmcnamara, ned.deily, r.david.murray, Martin.Budaj messages:
+ msg124361 |
2010-12-19 15:16:07 | r.david.murray | set | nosy:
skip.montanaro, andrewmcnamara, ned.deily, r.david.murray, Martin.Budaj messages:
+ msg124359 |
2010-12-19 14:19:39 | skip.montanaro | set | nosy:
skip.montanaro, andrewmcnamara, ned.deily, r.david.murray, Martin.Budaj messages:
+ msg124354 |
2010-12-16 08:56:27 | rhettinger | set | assignee: skip.montanaro nosy:
skip.montanaro, andrewmcnamara, ned.deily, r.david.murray, Martin.Budaj |
2010-12-14 14:54:39 | Martin.Budaj | set | messages:
+ msg123949 |
2010-12-11 14:10:41 | r.david.murray | set | messages:
+ msg123791 |
2010-12-11 12:32:23 | skip.montanaro | set | messages:
+ msg123787 |
2010-12-11 03:47:57 | r.david.murray | set | files:
+ csv_delimiter_tests.patch
messages:
+ msg123773 |
2010-12-11 03:46:52 | r.david.murray | set | messages:
+ msg123772 |
2010-12-04 22:41:06 | Martin.Budaj | set | files:
+ test_csv.patch
messages:
+ msg123406 |
2010-12-02 18:48:43 | r.david.murray | set | nosy:
+ r.david.murray, skip.montanaro messages:
+ msg123100
|
2010-11-24 22:04:22 | ned.deily | set | files:
+ p2.patch
messages:
+ msg122316 stage: test needed -> patch review |
2010-11-24 22:01:41 | ned.deily | set | files:
+ p1.patch keywords:
+ patch |
2010-11-24 22:00:25 | ned.deily | set | files:
+ test.py |
2010-11-24 21:06:59 | Martin.Budaj | set | files:
+ fix.tar.gz
messages:
+ msg122312 |
2010-11-24 00:56:15 | andrewmcnamara | set | nosy:
+ andrewmcnamara
|
2010-11-23 18:37:42 | ned.deily | set | versions:
+ Python 3.2, - Python 2.6 nosy:
+ ned.deily
messages:
+ msg122230
stage: test needed |
2010-11-23 17:55:40 | Martin.Budaj | create | |