This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author Tiago Wright
Recipients Tiago Wright, peter.otten, skip.montanaro
Date 2015-08-04.21:49:11
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <CAFxr9VpCq6PqwT94XJEY4_H4JbfuWg=8BJm4fL87PcLkwN_BxA@mail.gmail.com>
In-reply-to <1438702153.35.0.974172705699.issue24787@psf.upfronthosting.co.za>
Content
I agree that the parameters are easily deduced for any one csv file after a
quick inspection. The reason I went searching for a good sniffer was that I
have ~2100 csv files of slightly different formats coming from different
sources. In some cases, a csv file is sent directly to me, other times it
is first opened in excel and saved, and other times it is copy-pasted from
excel into another location, yielding 3 variations on the formatting from a
single source. Multiply that by 8 different sources of data.

For hacking disparate data sources together, it is more interesting to have
a sniffer that works really well to distinguish among the most common
dialects of csv, than one that tries to deduce the parameters of a rare or
unknown format. I agree with you that it would be a rare case that the
format is completely unknown -- more likely, you know it is one of two or
three possible options and don't want to have to inspect each file to find
out which.

Unfortunately, trying to limit delimiters to some of the most common ones
using the delimiters parameter just raises an error:

Python 3.4.3 (v3.4.3:9b73f1c3e601, Feb 23 2015, 02:52:03)
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import csv
>>> csv.Sniffer().sniff("""\
... Invoice File,Credit Memo,Amount Claimed,Description,Invoice,Message,
... Sscanner ac15072911220.pdf,CM_15203,41.56,MX Jan Feb,948198,,
... Sscanner ac15072911221.pdf,CM 16148,41.50,MX Unkwon,948199,,
... Sscanner ac15072911230.pdf,CM 16148,6.42,MX Cavalier,948200,Photos
don't match the invoice
... Sscanner ac15072911261.pdf,CM_14464,0.06,MX Dutiful,948203,,
... Sscanner ac15072911262.pdf,CM 16148,88,MX Apr,948202,,
... Sscanner ac15072911250.pdf,CM_14464,94.08,MX Jan Feb,948208,,
... Sscanner ac15072911251.pdf,CM_17491,39.84,MX Unkwon,948207,,
... Sscanner ac15072911253.pdf,CM_14464,42.07,MX Cavalier,,,
... Sscanner ac15072911253.pdf,CM_14464,2.23,MX Dutiful,,,
... Sscanner ac15072911253.pdf,CM_14464,12.84,MX Apr,,,
... """).delimiter
'M'
>>> csv.Sniffer().sniff("""\
... Invoice File,Credit Memo,Amount Claimed,Description,Invoice,Message,
... Sscanner ac15072911220.pdf,CM_15203,41.56,MX Jan Feb,948198,,
... Sscanner ac15072911221.pdf,CM 16148,41.50,MX Unkwon,948199,,
... Sscanner ac15072911230.pdf,CM 16148,6.42,MX Cavalier,948200,Photos
don't match the invoice
... Sscanner ac15072911261.pdf,CM_14464,0.06,MX Dutiful,948203,,
... Sscanner ac15072911262.pdf,CM 16148,88,MX Apr,948202,,
... Sscanner ac15072911250.pdf,CM_14464,94.08,MX Jan Feb,948208,,
... Sscanner ac15072911251.pdf,CM_17491,39.84,MX Unkwon,948207,,
... Sscanner ac15072911253.pdf,CM_14464,42.07,MX Cavalier,,,
... Sscanner ac15072911253.pdf,CM_14464,2.23,MX Dutiful,,,
... Sscanner ac15072911253.pdf,CM_14464,12.84,MX Apr,,,
... """, delimiters=",\t|^").delimiter
Traceback (most recent call last):
  File "<stdin>", line 13, in <module>
  File
"/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/csv.py",
line 189, in sniff
    raise Error("Could not determine delimiter")
_csv.Error: Could not determine delimiter

On Tue, Aug 4, 2015 at 8:29 AM Skip Montanaro <report@bugs.python.org>
wrote:

>
> Skip Montanaro added the comment:
>
> I should have probably pointed out that the Sniffer class is the unloved
> stepchild of the csv module. In my experience it is rarely necessary. You
> either:
>
> * Are reading CSV files which are about what Excel would produce with its
> default settings
>
> or
>
> * Know just what your format is, and can define the various parameters
> easily
>
> It's pretty rare, I think, to get a delimited file in some format which is
> completely unknown and which thus has to be deduced.
>
> As Peter showed, the Sniffer class is also kind of unreliable. I didn't
> write it, and there are precious few test cases for it. One of your
> datasets should probably be added to the mix and bugs fixed.
>
> ----------
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue24787>
> _______________________________________
>
History
Date User Action Args
2015-08-04 21:49:12Tiago Wrightsetrecipients: + Tiago Wright, skip.montanaro, peter.otten
2015-08-04 21:49:12Tiago Wrightlinkissue24787 messages
2015-08-04 21:49:11Tiago Wrightcreate