This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: csv.Sniffer guesses "M" instead of \t or , as the delimiter
Type: behavior Stage:
Components: Extension Modules Versions: Python 3.4
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Tiago Wright, peter.otten, r.david.murray, skip.montanaro
Priority: normal Keywords:

Created on 2015-08-04 00:05 by Tiago Wright, last changed 2022-04-11 14:58 by admin.

Files
File name Uploaded Description Edit
csvsniffertest3.htm Tiago Wright, 2015-08-06 15:54
csvsniffertest3.txt Tiago Wright, 2015-08-06 19:29
csvsniffertest5.txt Tiago Wright, 2015-08-07 00:45
testround8.py Tiago Wright, 2015-08-07 22:28
Messages (16)
msg247967 - (view) Author: Tiago Wright (Tiago Wright) Date: 2015-08-04 00:05
csv.Sniffer().sniff() guesses "M" for the delimiter of the first dataset below. The same error occurs when the "," is replaced by "\t". However, it correctly guesses "," for the second dataset.

---Dataset 1----
Invoice File,Credit Memo,Amount Claimed,Description,Invoice,Message,
Sscanner ac15072911220.pdf,CM_15203,28714.32,MX Jan Feb,948198,,
Sscanner ac15072911221.pdf,CM 16148,15600,MX Unkwon,948199,,
Sscanner ac15072911230.pdf,CM 16148,33488,MX Cavalier,948200,Photos don't match the invoice
Sscanner ac15072911261.pdf,CM_14464,1713.6,MX Dutiful,948203,,
Sscanner ac15072911262.pdf,CM 16148,3114,MX Apr,948202,,
Sscanner ac15072911250.pdf,CM_14464,1232.28,MX Jan Feb,948208,,
Sscanner ac15072911251.pdf,CM_17491,15232,MX Unkwon,948207,,
Sscanner ac15072911253.pdf,CM_14464,11250,MX Cavalier,,,
Sscanner ac15072911253.pdf,CM_14464,11250,MX Dutiful,,,
Sscanner ac15072911253.pdf,CM_14464,11250,MX Apr,,,

--- Dataset 2---
Invoice File,Credit Memo,Amount Claimed,Description,Invoice,Message,
Sscanner ac15072911220.pdf,CM_15203,82.07,MX Jan Feb,948198,,
Sscanner ac15072911221.pdf,CM 16148,23.29,MX Unkwon,948199,,
Sscanner ac15072911230.pdf,CM 16148,88.55,MX Cavalier,948200,Photos don't match the invoice,
Sscanner ac15072911261.pdf,CM_14464,58.78,MX Dutiful,948203,,
Sscanner ac15072911262.pdf,CM 16148,52,MX Apr,948202,,
Sscanner ac15072911250.pdf,CM_14464,40.40,MX Jan Feb,948208,,
Sscanner ac15072911251.pdf,CM_17491,54.97,MX Unkwon,948207,,
Sscanner ac15072911253.pdf,CM_14464,4.08,MX Cavalier,,,
Sscanner ac15072911253.pdf,CM_14464,49.11,MX Dutiful,,,
Sscanner ac15072911253.pdf,CM_14464,18.28,MX Apr,,,
msg247973 - (view) Author: Skip Montanaro (skip.montanaro) * (Python triager) Date: 2015-08-04 01:48
How are you calling the sniff() method? Note that it takes a sample of the CSV file. For example, this works for me:

>>> f = open("sniff1.csv")
>>> dialect = csv.Sniffer().sniff(next(open("sniff1.csv")))
>>> dialect.delimiter 
','
>>> dialect.lineterminator
'\r\n'

where sniff1.csv is your Dataset 1. (I think for reliable operation you really want your sample to be a multiple of whole lines.)
msg247986 - (view) Author: Peter Otten (peter.otten) * Date: 2015-08-04 07:20
The sniffer actually changes its "mind" in the fourth line:

Python 3.4.0 (default, Jun 19 2015, 14:20:21) 
[GCC 4.8.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import csv
>>> csv.Sniffer().sniff("""\
... Invoice File,Credit Memo,Amount Claimed,Description,Invoice,Message,
... Sscanner ac15072911220.pdf,CM_15203,28714.32,MX Jan Feb,948198,,
... Sscanner ac15072911221.pdf,CM 16148,15600,MX Unkwon,948199,,
... """).delimiter
','
>>> csv.Sniffer().sniff("""\
... Invoice File,Credit Memo,Amount Claimed,Description,Invoice,Message,
... Sscanner ac15072911220.pdf,CM_15203,28714.32,MX Jan Feb,948198,,
... Sscanner ac15072911221.pdf,CM 16148,15600,MX Unkwon,948199,,
... Sscanner ac15072911230.pdf,CM 16148,33488,MX Cavalier,948200,Photos don't match the invoice
... """).delimiter
'M'

That line has only 5 commas while all others have 6. Unfortunately all lines contain exactly two "M"...
msg247990 - (view) Author: Skip Montanaro (skip.montanaro) * (Python triager) Date: 2015-08-04 15:29
I should have probably pointed out that the Sniffer class is the unloved stepchild of the csv module. In my experience it is rarely necessary. You either:

* Are reading CSV files which are about what Excel would produce with its default settings

or

* Know just what your format is, and can define the various parameters easily

It's pretty rare, I think, to get a delimited file in some format which is completely unknown and which thus has to be deduced.

As Peter showed, the Sniffer class is also kind of unreliable. I didn't write it, and there are precious few test cases for it. One of your datasets should probably be added to the mix and bugs fixed.
msg248005 - (view) Author: Tiago Wright (Tiago Wright) Date: 2015-08-04 21:49
I agree that the parameters are easily deduced for any one csv file after a
quick inspection. The reason I went searching for a good sniffer was that I
have ~2100 csv files of slightly different formats coming from different
sources. In some cases, a csv file is sent directly to me, other times it
is first opened in excel and saved, and other times it is copy-pasted from
excel into another location, yielding 3 variations on the formatting from a
single source. Multiply that by 8 different sources of data.

For hacking disparate data sources together, it is more interesting to have
a sniffer that works really well to distinguish among the most common
dialects of csv, than one that tries to deduce the parameters of a rare or
unknown format. I agree with you that it would be a rare case that the
format is completely unknown -- more likely, you know it is one of two or
three possible options and don't want to have to inspect each file to find
out which.

Unfortunately, trying to limit delimiters to some of the most common ones
using the delimiters parameter just raises an error:

Python 3.4.3 (v3.4.3:9b73f1c3e601, Feb 23 2015, 02:52:03)
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import csv
>>> csv.Sniffer().sniff("""\
... Invoice File,Credit Memo,Amount Claimed,Description,Invoice,Message,
... Sscanner ac15072911220.pdf,CM_15203,41.56,MX Jan Feb,948198,,
... Sscanner ac15072911221.pdf,CM 16148,41.50,MX Unkwon,948199,,
... Sscanner ac15072911230.pdf,CM 16148,6.42,MX Cavalier,948200,Photos
don't match the invoice
... Sscanner ac15072911261.pdf,CM_14464,0.06,MX Dutiful,948203,,
... Sscanner ac15072911262.pdf,CM 16148,88,MX Apr,948202,,
... Sscanner ac15072911250.pdf,CM_14464,94.08,MX Jan Feb,948208,,
... Sscanner ac15072911251.pdf,CM_17491,39.84,MX Unkwon,948207,,
... Sscanner ac15072911253.pdf,CM_14464,42.07,MX Cavalier,,,
... Sscanner ac15072911253.pdf,CM_14464,2.23,MX Dutiful,,,
... Sscanner ac15072911253.pdf,CM_14464,12.84,MX Apr,,,
... """).delimiter
'M'
>>> csv.Sniffer().sniff("""\
... Invoice File,Credit Memo,Amount Claimed,Description,Invoice,Message,
... Sscanner ac15072911220.pdf,CM_15203,41.56,MX Jan Feb,948198,,
... Sscanner ac15072911221.pdf,CM 16148,41.50,MX Unkwon,948199,,
... Sscanner ac15072911230.pdf,CM 16148,6.42,MX Cavalier,948200,Photos
don't match the invoice
... Sscanner ac15072911261.pdf,CM_14464,0.06,MX Dutiful,948203,,
... Sscanner ac15072911262.pdf,CM 16148,88,MX Apr,948202,,
... Sscanner ac15072911250.pdf,CM_14464,94.08,MX Jan Feb,948208,,
... Sscanner ac15072911251.pdf,CM_17491,39.84,MX Unkwon,948207,,
... Sscanner ac15072911253.pdf,CM_14464,42.07,MX Cavalier,,,
... Sscanner ac15072911253.pdf,CM_14464,2.23,MX Dutiful,,,
... Sscanner ac15072911253.pdf,CM_14464,12.84,MX Apr,,,
... """, delimiters=",\t|^").delimiter
Traceback (most recent call last):
  File "<stdin>", line 13, in <module>
  File
"/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/csv.py",
line 189, in sniff
    raise Error("Could not determine delimiter")
_csv.Error: Could not determine delimiter

On Tue, Aug 4, 2015 at 8:29 AM Skip Montanaro <report@bugs.python.org>
wrote:

>
> Skip Montanaro added the comment:
>
> I should have probably pointed out that the Sniffer class is the unloved
> stepchild of the csv module. In my experience it is rarely necessary. You
> either:
>
> * Are reading CSV files which are about what Excel would produce with its
> default settings
>
> or
>
> * Know just what your format is, and can define the various parameters
> easily
>
> It's pretty rare, I think, to get a delimited file in some format which is
> completely unknown and which thus has to be deduced.
>
> As Peter showed, the Sniffer class is also kind of unreliable. I didn't
> write it, and there are precious few test cases for it. One of your
> datasets should probably be added to the mix and bugs fixed.
>
> ----------
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue24787>
> _______________________________________
>
msg248007 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2015-08-04 22:51
If you look at the algorithm it is doing some fancy things with metrics, but does have a 'preferred delimiters' list that it checks.  It is possible things could be improved either by tweaking the threshold or by somehow giving added weight to the metrics when the candidate character is in the preferred delimiter list.

We might have to do this with a feature flag to turn it on, though, since it could change the results for programs that happen to work with the current algorithm.
msg248093 - (view) Author: Tiago Wright (Tiago Wright) Date: 2015-08-06 01:40
I've run the Sniffer against 1614 csv files on my computer and compared the
delimiter it detects to what I have set manually. Here are the results:

 Sniffer            Human,;\t\(blank)Error:)ceMpGrand TotalError rate,498  2
110  1   5122.7%; 1          10.0%\t3 922 69121  227105412.5%|   33
330.0%space    91   4  1435.7%Grand Total5011922351610221142271614
-Tiago

On Tue, Aug 4, 2015 at 3:51 PM R. David Murray <report@bugs.python.org>
wrote:

>
> R. David Murray added the comment:
>
> If you look at the algorithm it is doing some fancy things with metrics,
> but does have a 'preferred delimiters' list that it checks.  It is possible
> things could be improved either by tweaking the threshold or by somehow
> giving added weight to the metrics when the candidate character is in the
> preferred delimiter list.
>
> We might have to do this with a feature flag to turn it on, though, since
> it could change the results for programs that happen to work with the
> current algorithm.
>
> ----------
> nosy: +r.david.murray
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue24787>
> _______________________________________
>
msg248098 - (view) Author: Skip Montanaro (skip.montanaro) * (Python triager) Date: 2015-08-06 03:14
Tiago, sorry, but your last post with results is completely unintelligible. Can you toss the table in a file and attach it instead?
msg248133 - (view) Author: Tiago Wright (Tiago Wright) Date: 2015-08-06 15:54
Table attached.

-Tiago

On Wed, Aug 5, 2015 at 8:14 PM Skip Montanaro <report@bugs.python.org>
wrote:

>
> Skip Montanaro added the comment:
>
> Tiago, sorry, but your last post with results is completely
> unintelligible. Can you toss the table in a file and attach it instead?
>
> ----------
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue24787>
> _______________________________________
>
msg248139 - (view) Author: Tiago Wright (Tiago Wright) Date: 2015-08-06 19:22
It seems the HTML file did not come through correctly. Trying a text
version, please view this in a monospace font:

        |   Sniffer
                                    |
Human   |   ,   |   ;   |   \t  |   \   |  space|Except |   :   |   )   |
c   |   e   |   M   |   p   |Total  |   %Error
---------------------------------------------------------------------------------------------------------------------------
    ,   |   498 |       |       |   2   |   1   |   10  |       |       |
1   |       |       |       |   512 |   2.7%
    ;   |       |   1   |       |       |       |       |       |       |
    |       |       |       |   1   |   0.0%
    \t  |   3   |       |   922 |       |   6   |   91  |   2   |   1   |
    |       |   2   |   27  |   1054|   12.5%
    |   |       |       |       |   33  |       |       |       |       |
    |       |       |       |   33  |   0.0%
space   |       |       |       |       |   9   |   1   |       |       |
    |   4   |       |       |   14  |   35.7%
---------------------------------------------------------------------------------------------------------------------------
Total   |   501 |   1   |   922 |   35  |   16  |   102 |   2   |   1   |
1   |   4   |   2   |   27  |   1614

On Thu, Aug 6, 2015 at 8:54 AM Tiago Wright <report@bugs.python.org> wrote:

>
> Tiago Wright added the comment:
>
> Table attached.
>
> -Tiago
>
> On Wed, Aug 5, 2015 at 8:14 PM Skip Montanaro <report@bugs.python.org>
> wrote:
>
> >
> > Skip Montanaro added the comment:
> >
> > Tiago, sorry, but your last post with results is completely
> > unintelligible. Can you toss the table in a file and attach it instead?
> >
> > ----------
> >
> > _______________________________________
> > Python tracker <report@bugs.python.org>
> > <http://bugs.python.org/issue24787>
> > _______________________________________
> >
>
> ----------
> Added file: http://bugs.python.org/file40138/csvsniffertest3.htm
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue24787>
> _______________________________________
msg248141 - (view) Author: Tiago Wright (Tiago Wright) Date: 2015-08-06 19:29
I apologize, it seems the text table got line wrapped. This time as a TXT
attachment.

-Tiago

On Thu, Aug 6, 2015 at 12:22 PM Tiago Wright <report@bugs.python.org> wrote:

>
> Tiago Wright added the comment:
>
>
>
msg248142 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2015-08-06 19:29
Your best bet is to attach an ascii text file as an uploaded file.
msg248143 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2015-08-06 19:29
Yes, much better :)
msg248162 - (view) Author: Tiago Wright (Tiago Wright) Date: 2015-08-07 00:45
I've run the Sniffer against the same data set, but varied the size of the
sample given to the code. It seems that feeding it more data actually seems
to make the results less accurate. Table attached.
On Thu, Aug 6, 2015 at 12:29 PM R. David Murray <report@bugs.python.org>
wrote:

>
> R. David Murray added the comment:
>
> Yes, much better :)
>
> ----------
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue24787>
> _______________________________________
>
msg248233 - (view) Author: Tiago Wright (Tiago Wright) Date: 2015-08-07 22:28
Attached is a .py file with 32 test cases for the Sniff class, 18 that
fail, 14 that pass.

My hope is that these samples can be used to improve the delimiter
detection code.

-Tiago
msg248254 - (view) Author: Peter Otten (peter.otten) * Date: 2015-08-08 07:49
Have you considered writing your own little sniffer? Getting it right for your actual data is usually easier to achieve than a general solution.

The following simplistic sniffer should work with your samples:

def make_dialect(delimiter):
    class Dialect(csv.excel):
        pass
    Dialect.delimiter = delimiter
    return Dialect

def sniff(sample):
    count, delimiter = max(
        ((sample.count(delim), delim) for delim in ",\t|;"),
        key=operator.itemgetter(0))
    if count == 0:
        if " " in sample:
            delimiter = " "
        else:
            raise csv.Error("Could not determine delimiter")
    return make_dialect(delimiter)

Tiago, If you want to follow that path we should take the discussion to the general python mailing list.
History
Date User Action Args
2022-04-11 14:58:19adminsetgithub: 68975
2015-08-08 07:49:08peter.ottensetmessages: + msg248254
2015-08-07 22:28:39Tiago Wrightsetfiles: + testround8.py

messages: + msg248233
2015-08-07 00:45:50Tiago Wrightsetfiles: + csvsniffertest5.txt

messages: + msg248162
2015-08-06 19:29:58r.david.murraysetmessages: + msg248143
2015-08-06 19:29:23r.david.murraysetmessages: + msg248142
2015-08-06 19:29:13Tiago Wrightsetfiles: + csvsniffertest3.txt

messages: + msg248141
2015-08-06 19:22:47Tiago Wrightsetmessages: + msg248139
2015-08-06 15:54:09Tiago Wrightsetfiles: + csvsniffertest3.htm

messages: + msg248133
2015-08-06 03:14:47skip.montanarosetmessages: + msg248098
2015-08-06 01:40:48Tiago Wrightsetmessages: + msg248093
2015-08-04 22:51:58r.david.murraysetnosy: + r.david.murray
messages: + msg248007
2015-08-04 21:49:12Tiago Wrightsetmessages: + msg248005
2015-08-04 15:29:13skip.montanarosetmessages: + msg247990
2015-08-04 07:20:02peter.ottensetnosy: + peter.otten
messages: + msg247986
2015-08-04 01:48:34skip.montanarosetnosy: + skip.montanaro
messages: + msg247973
2015-08-04 00:05:57Tiago Wrightcreate