Issue 29405: improve csv.Sniffer().sniff() behavior

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/73591

classification

Title:	improve csv.Sniffer().sniff() behavior
Type:	enhancement	Stage:	resolved
Components:	Library (Lib)	Versions:	Python 3.7

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	berker.peksag, mepstein, python-dev, xiang.zhang
Priority:	normal	Keywords:	patch

Created on 2017-02-01 00:26 by mepstein, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
csv_sniffer.patch	xiang.zhang, 2017-02-04 06:11		review
csv.py.patch	mepstein, 2017-02-04 07:33

Messages (9)
msg286570 - (view)	Author: Milt Epstein (mepstein)	Date: 2017-02-01 00:26
I'm trying to use csv.Sniffer().sniff(sample_data) to determine the delimiter on a number of input files. Through some trial and error, many "Could not determine delimiter" errors, and analyzing how this routine works/behaves, I settled on sample_data being some number of lines of the input file, particularly 30. This value seems to allow the routine to work more frequently, although not always, particularly on short input files. I realize the way this routine works is somewhat idiosyncratic, and it won't be so easy to improve it generally, but there's one simple change that occurred to me that would help in some cases. Currently the function _guess_delimiter() in csv.py contains the following lines: # build a list of possible delimiters modeList = modes.items() total = float(chunkLength * iteration) So total is increased by chunkLength on each iteration. The problem occurs when total becomes greater than the length of sample_data, that is, the iteration would go beyond the end of sample_data. That reading is handled fine, it's truncated at the end of sample_data, but total is needlessly set too high. My suggested change is to add the following two lines after the above: if total > len(data): total = float(len(data))
msg286571 - (view)	Author: Milt Epstein (mepstein)	Date: 2017-02-01 00:35
FWIW, it might be more concise and more consistent with the existing code to change the one line to: total = min(float(chunkLength * iteration), float(len(data)))
msg286913 - (view)	Author: Xiang Zhang (xiang.zhang) *	Date: 2017-02-04 06:11
Sounds reasonable. IIUC if the sample data gets 11 lines the total could be 20. I also think the second min is redundant. Would you mind review my patch Milt?
msg286923 - (view)	Author: Milt Epstein (mepstein)	Date: 2017-02-04 07:33
That's right, with 11 lines in the sample data, total will become 20 on the second iteration. And that throws off some of the computations done in that function. Your patch looks good, in that it will achieve what I'm requesting. But :-), your pointing out that other redundant min() made me take a closer look at the code, and led me to produce the attached patch as an alternate suggestion. I think it makes the code a bit more sensible and cleaner. Please review, and go with what you think best. Thanks.
msg287070 - (view)	Author: Roundup Robot (python-dev)	Date: 2017-02-06 02:51
New changeset 724d1aa7589b by Xiang Zhang in branch 'default': Issue #29405: Make total calculation in _guess_delimiter more accurate. https://hg.python.org/cpython/rev/724d1aa7589b
msg287071 - (view)	Author: Xiang Zhang (xiang.zhang) *	Date: 2017-02-06 02:53
Thanks Milt. I committed with my change not because it's better, but I want to make the change small so others won't get unfamiliar with the new code. :-)
msg287072 - (view)	Author: Roundup Robot (python-dev)	Date: 2017-02-06 03:00
New changeset 6d0a09623155710680ff19f05f279d45c007a304 by Xiang Zhang in branch 'master': Issue #29405: Make total calculation in _guess_delimiter more accurate. https://github.com/python/cpython/commit/6d0a09623155710680ff19f05f279d45c007a304
msg287073 - (view)	Author: Berker Peksag (berker.peksag) *	Date: 2017-02-06 03:15
> [...] but I want to make the change small so others won't get unfamiliar with the new code. :-) Assuming it doesn't cause any behavior changes, I find Milt's patch simple enough and easier to understand than the version uses 'iteration' variable.
msg287074 - (view)	Author: Xiang Zhang (xiang.zhang) *	Date: 2017-02-06 03:38
I am fine with any version (both are simple and not the hardest part to understand in the logic). :-) I have no opinion on which is better.

History
Date	User	Action	Args
2022-04-11 14:58:42	admin	set	github: 73591
2017-02-06 03:38:39	xiang.zhang	set	messages: + msg287074
2017-02-06 03:15:44	berker.peksag	set	nosy: + berker.peksag messages: + msg287073
2017-02-06 03:00:22	python-dev	set	messages: + msg287072
2017-02-06 02:53:45	xiang.zhang	set	status: open -> closed resolution: fixed messages: + msg287071 stage: commit review -> resolved
2017-02-06 02:51:04	python-dev	set	nosy: + python-dev messages: + msg287070
2017-02-04 07:36:24	xiang.zhang	set	stage: patch review -> commit review
2017-02-04 07:33:35	mepstein	set	files: + csv.py.patch messages: + msg286923
2017-02-04 06:11:41	xiang.zhang	set	files: + csv_sniffer.patch versions: + Python 3.7, - Python 3.5 messages: + msg286913 keywords: + patch type: behavior -> enhancement stage: patch review
2017-02-03 17:05:06	xiang.zhang	set	nosy: + xiang.zhang
2017-02-01 00:35:04	mepstein	set	messages: + msg286571
2017-02-01 00:26:25	mepstein	create