Message 286570 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	mepstein
Recipients	mepstein
Date	2017-02-01.00:26:24
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1485908785.08.0.552892497325.issue29405@psf.upfronthosting.co.za>
In-reply-to

Content
I'm trying to use csv.Sniffer().sniff(sample_data) to determine the delimiter on a number of input files. Through some trial and error, many "Could not determine delimiter" errors, and analyzing how this routine works/behaves, I settled on sample_data being some number of lines of the input file, particularly 30. This value seems to allow the routine to work more frequently, although not always, particularly on short input files. I realize the way this routine works is somewhat idiosyncratic, and it won't be so easy to improve it generally, but there's one simple change that occurred to me that would help in some cases. Currently the function _guess_delimiter() in csv.py contains the following lines: # build a list of possible delimiters modeList = modes.items() total = float(chunkLength * iteration) So total is increased by chunkLength on each iteration. The problem occurs when total becomes greater than the length of sample_data, that is, the iteration would go beyond the end of sample_data. That reading is handled fine, it's truncated at the end of sample_data, but total is needlessly set too high. My suggested change is to add the following two lines after the above: if total > len(data): total = float(len(data))

I'm trying to use csv.Sniffer().sniff(sample_data) to determine the delimiter on a number of input files.  Through some trial and error, many "Could not determine delimiter" errors, and analyzing how this routine works/behaves, I settled on sample_data being some number of lines of the input file, particularly 30.  This value seems to allow the routine to work more frequently, although not always, particularly on short input files.

I realize the way this routine works is somewhat idiosyncratic, and it won't be so easy to improve it generally, but there's one simple change that occurred to me that would help in some cases.  Currently the function _guess_delimiter() in csv.py contains the following lines:

            # build a list of possible delimiters
            modeList = modes.items()
            total = float(chunkLength * iteration)

So total is increased by chunkLength on each iteration.  The problem occurs when total becomes greater than the length of sample_data, that is, the iteration would go beyond the end of sample_data.  That reading is handled fine, it's truncated at the end of sample_data, but total is needlessly set too high.  My suggested change is to add the following two lines after the above:

            if total > len(data):
                total = float(len(data))

History
Date	User	Action	Args
2017-02-01 00:26:25	mepstein	set	recipients: + mepstein
2017-02-01 00:26:25	mepstein	set	messageid: <1485908785.08.0.552892497325.issue29405@psf.upfronthosting.co.za>
2017-02-01 00:26:25	mepstein	link	issue29405 messages
2017-02-01 00:26:24	mepstein	create