classification
Title: improve csv.Sniffer().sniff() behavior
Type: enhancement Stage: resolved
Components: Library (Lib) Versions: Python 3.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: berker.peksag, mepstein, python-dev, xiang.zhang
Priority: normal Keywords: patch

Created on 2017-02-01 00:26 by mepstein, last changed 2017-02-06 03:38 by xiang.zhang. This issue is now closed.

Files
File name Uploaded Description Edit
csv_sniffer.patch xiang.zhang, 2017-02-04 06:11 review
csv.py.patch mepstein, 2017-02-04 07:33
Messages (9)
msg286570 - (view) Author: Milt Epstein (mepstein) Date: 2017-02-01 00:26
I'm trying to use csv.Sniffer().sniff(sample_data) to determine the delimiter on a number of input files.  Through some trial and error, many "Could not determine delimiter" errors, and analyzing how this routine works/behaves, I settled on sample_data being some number of lines of the input file, particularly 30.  This value seems to allow the routine to work more frequently, although not always, particularly on short input files.

I realize the way this routine works is somewhat idiosyncratic, and it won't be so easy to improve it generally, but there's one simple change that occurred to me that would help in some cases.  Currently the function _guess_delimiter() in csv.py contains the following lines:

            # build a list of possible delimiters
            modeList = modes.items()
            total = float(chunkLength * iteration)

So total is increased by chunkLength on each iteration.  The problem occurs when total becomes greater than the length of sample_data, that is, the iteration would go beyond the end of sample_data.  That reading is handled fine, it's truncated at the end of sample_data, but total is needlessly set too high.  My suggested change is to add the following two lines after the above:

            if total > len(data):
                total = float(len(data))
msg286571 - (view) Author: Milt Epstein (mepstein) Date: 2017-02-01 00:35
FWIW, it might be more concise and more consistent with the existing code to change the one line to:

            total = min(float(chunkLength * iteration), float(len(data)))
msg286913 - (view) Author: Xiang Zhang (xiang.zhang) * (Python committer) Date: 2017-02-04 06:11
Sounds reasonable. IIUC if the sample data gets 11 lines the total could be 20. I also think the second min is redundant. Would you mind review my patch Milt?
msg286923 - (view) Author: Milt Epstein (mepstein) Date: 2017-02-04 07:33
That's right, with 11 lines in the sample data, total will become 20 on the second iteration.  And that throws off some of the computations done in that function.

Your patch looks good, in that it will achieve what I'm requesting.  But :-), your pointing out that other redundant min() made me take a closer look at the code, and led me to produce the attached patch as an alternate suggestion.  I think it makes the code a bit more sensible and cleaner.  Please review, and go with what you think best.

Thanks.
msg287070 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2017-02-06 02:51
New changeset 724d1aa7589b by Xiang Zhang in branch 'default':
Issue #29405: Make total calculation in _guess_delimiter more accurate.
https://hg.python.org/cpython/rev/724d1aa7589b
msg287071 - (view) Author: Xiang Zhang (xiang.zhang) * (Python committer) Date: 2017-02-06 02:53
Thanks Milt. I committed with my change not because it's better, but I want to make the change small so others won't get unfamiliar with the new code. :-)
msg287072 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2017-02-06 03:00
New changeset 6d0a09623155710680ff19f05f279d45c007a304 by Xiang Zhang in branch 'master':
Issue #29405: Make total calculation in _guess_delimiter more accurate.
https://github.com/python/cpython/commit/6d0a09623155710680ff19f05f279d45c007a304
msg287073 - (view) Author: Berker Peksag (berker.peksag) * (Python committer) Date: 2017-02-06 03:15
> [...] but I want to make the change small so others won't get unfamiliar with the new code. :-)

Assuming it doesn't cause any behavior changes, I find Milt's patch simple enough and easier to understand than the version uses 'iteration' variable.
msg287074 - (view) Author: Xiang Zhang (xiang.zhang) * (Python committer) Date: 2017-02-06 03:38
I am fine with any version (both are simple and not the hardest part to understand in the logic). :-) I have no opinion on which is better.
History
Date User Action Args
2017-02-06 03:38:39xiang.zhangsetmessages: + msg287074
2017-02-06 03:15:44berker.peksagsetnosy: + berker.peksag
messages: + msg287073
2017-02-06 03:00:22python-devsetmessages: + msg287072
2017-02-06 02:53:45xiang.zhangsetstatus: open -> closed
resolution: fixed
messages: + msg287071

stage: commit review -> resolved
2017-02-06 02:51:04python-devsetnosy: + python-dev
messages: + msg287070
2017-02-04 07:36:24xiang.zhangsetstage: patch review -> commit review
2017-02-04 07:33:35mepsteinsetfiles: + csv.py.patch

messages: + msg286923
2017-02-04 06:11:41xiang.zhangsetfiles: + csv_sniffer.patch
versions: + Python 3.7, - Python 3.5
messages: + msg286913

keywords: + patch
type: behavior -> enhancement
stage: patch review
2017-02-03 17:05:06xiang.zhangsetnosy: + xiang.zhang
2017-02-01 00:35:04mepsteinsetmessages: + msg286571
2017-02-01 00:26:25mepsteincreate