This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: CSV Sniffer does not function properly on single column .csv files
Type: behavior Stage:
Components: Library (Lib) Versions: Python 3.0, Python 2.4, Python 2.6, Python 2.5
process
Status: closed Resolution: postponed
Dependencies: Superseder:
Assigned To: skip.montanaro Nosy List: amaury.forgeotdarc, jplaverdure, skip.montanaro, tds333
Priority: low Keywords: patch

Created on 2008-02-12 16:01 by jplaverdure, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
listB2Mforblast.csv jplaverdure, 2008-02-12 16:01 Single column .csv file
csv.diff skip.montanaro, 2008-04-13 03:25 patch making comma the preferred delimiter for single column files
Messages (12)
msg62319 - (view) Author: Jean-Philippe Laverdure (jplaverdure) Date: 2008-02-12 16:01
When attempting to sniff() the dialect for the attached .csv file,
csv.Sniffer.sniff() returns an unusable dialect:

>>> import csv
>>> file = open('listB2Mforblast.csv', 'r')
>>> dialect = csv.Sniffer().sniff(file.readline())
>>> file.seek(0)
>>> file.readline()
>>> file.seek(0)
>>> reader = csv.DictReader(file, dialect)
>>> reader.next()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/soft/bioinfo/linux/python-2.5/lib/python2.5/csv.py", line 93,
in next
    d = dict(zip(self.fieldnames, row))
TypeError: zip argument #1 must support iteration

However, this works fine:
>>> file.seek(0)
>>> reader = csv.DictReader(file)
>>> reader.next()
{'Sequence': 'AALENTHLL'}

If I use a 2 column file, sniff() works perfectly.
It only seems to have a problem with single column .csv files (which are
still .csv files in my opinion)

Thanks for looking into this.
msg63788 - (view) Author: Wolfgang Langner (tds333) * Date: 2008-03-17 21:49
The sniffer returns an dialect that is not really correct. Because the
delimiter is set to value and in this case there is no delimiter.
See it as, it returns a random delimiter if there is not really one.

But your usage of the DictReader is wrong. It have to be called with
csv.DictReader(file, dialect=dialect) and then it works in this example.

This could be closed.
msg63910 - (view) Author: Skip Montanaro (skip.montanaro) * (Python triager) Date: 2008-03-18 12:35
What do you think the delimiter should be for this csv file?

43.4e12
147483648
47483648

What about this one?

abcdef
bcdefg
cdefgh

And this?

abc8def
bcd8efg
cde8fgh

If I force the sniffer to not allow digits or letters as
delimiters I can get the sniffer to return comma as the
delimiter in all three cases.  I'm not certain that's
correct in the third case though.
msg64059 - (view) Author: Wolfgang Langner (tds333) * Date: 2008-03-19 14:52
In this cases it is not really possible to sniff the right delimiter.
To not allow digits or letters is not a good solution.
I think the behavior as now is ok, and at this time I see now way to
improve it.
msg64062 - (view) Author: Skip Montanaro (skip.montanaro) * (Python triager) Date: 2008-03-19 15:33
Wolfgang> In this cases it is not really possible to sniff the right
    Wolfgang> delimiter.  To not allow digits or letters is not a good
    Wolfgang> solution.  I think the behavior as now is ok, and at this time
    Wolfgang> I see now way to improve it.

I mostly agree.  I'm waiting for the original submitter to chime in though.

Skip
msg64595 - (view) Author: Jean-Philippe Laverdure (jplaverdure) Date: 2008-03-27 15:19
Hello and sorry for the late reply.

Wolfgang: sorry about my misuse of the csv.DictReader constructor, that 
was a mistake on my part. However, it still is not functionning as I
think it should/could.  Look at this:

Using this content:
Sequence
AAGINRDSL
AAIANHQVL

and this piece of code:
f = open(sys.argv[-1], 'r')
dialect = csv.Sniffer().sniff(f.readline())
f.seek(0)
reader = csv.DictReader(f, dialect=dialect)
for line in reader:
    print line

I get this result:
{'Sequen': 'AAGINRDSL', 'e': None}
{'Sequen': 'AAIANHQVL', 'e': None}

When I really should be getting this:
{'Sequence': 'AAGINRDSL'}
{'Sequence': 'AAIANHQVL'}

The fact is this code is in use in an application where users can submit
a .csv file produced by Excel for treatment.  The file must contain a
"Sequence" column since that is what the treatment is run on. Now I had
to make the following changes to my code to account for the fact that
some users submit a single column file (since only the "Sequence" column
is required for treatment):

f = open(sys.argv[-1], 'r')
try:
    dialect = csv.Sniffer().sniff(f.readline(), [',', '\t'])
    f.seek(0)
    reader = csv.DictReader(f, dialect=dialect)
except:
    print '>>>caught csv sniff() exception'
    f.seek(0)
    reader = csv.DictReader(f)
for line in reader:
    Do what I need to do

Which really feels like a patched use of a buggy implementation of the
Sniffer class

I understand the issues raised by Skip in regards to figuring out a
delimiter at all costs...  But really, the Sniffer class should work
apropriately when a single column .csv file is submitted
msg64597 - (view) Author: Skip Montanaro (skip.montanaro) * (Python triager) Date: 2008-03-27 16:07
Jean-Philippe> The fact is this code is in use in an application where
    Jean-Philippe> users can submit a .csv file produced by Excel for
    Jean-Philippe> treatment.  The file must contain a "Sequence" column
    Jean-Philippe> since that is what the treatment is run on. Now I had to
    Jean-Philippe> make the following changes to my code to account for the
    Jean-Philippe> fact that some users submit a single column file (since
    Jean-Philippe> only the "Sequence" column is required for treatment):

    Jean-Philippe> f = open(sys.argv[-1], 'r')
    Jean-Philippe> try:
    Jean-Philippe>     dialect = csv.Sniffer().sniff(f.readline(), [',', '\t'])
    Jean-Philippe>     f.seek(0)
    Jean-Philippe>     reader = csv.DictReader(f, dialect=dialect)
    Jean-Philippe> except:
    Jean-Philippe>     print '>>>caught csv sniff() exception'
    Jean-Philippe>     f.seek(0)
    Jean-Philippe>     reader = csv.DictReader(f)
    Jean-Philippe> for line in reader:
    Jean-Philippe>     Do what I need to do

What exceptions are you catching?  Why are you only giving it a single line
of input as a sample?  What happens if you instead use f.read(1024) as the
sample?  When there is only a single column in the file and you give it a
delimiter set which doesn't include any characters in the file it (I think
correctly) raises an exception to tell you that it couldn't determine the
delimiter:

    >>> import csv
    >>> f = open("listB2Mforblast.csv")
    >>> dialect = csv.Sniffer().sniff(f.read(1024))
    >>> dialect.delimiter
    '"'
    >>> f.seek(0)
    >>> dialect = csv.Sniffer().sniff(f.read(1024), ",\t :;")
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/Users/skip/local/lib/python2.6/csv.py", line 161, in sniff
        raise Error, "Could not determine delimiter"
    _csv.Error: Could not determine delimiter

In that case, use csv.excel as the dialect.  It doesn't matter what you use
as the delimiter if it doesn't occur in the file, and if it can't figure out
the delimiter it's also not going to guess the quotechar.

    >>> try:
    ...     dialect = csv.Sniffer().sniff(f.read(1024), ",\t :;")
    ... except csv.Error:
    ...     dialect = csv.excel
    ... 

I personally don't much like the sniffer.  It doesn't use any knowledge of
the structure of a CSV file to guess the delimiter and quotechar (and those
are the only two parameters it does guess).  I would prefer if it just went
away, but folks use it so it's likely to remain in its current form for the
forseeable future.

Skip
msg64606 - (view) Author: Jean-Philippe Laverdure (jplaverdure) Date: 2008-03-27 20:39
Hi Skip,

You're right, it does seem that using f.read(1024) to feed the sniffer
works OK in my case and allows me to instantiate the DictReader
correctly...  Why that is I'm not sure though...

I was submitting the first line as I thought is was the right sample to
provide the sniffer for it to sniff the correct dialect regardless of
the file format and file content.

And yes, 'except csv.Error' is certainly a better way to trap my desired
exception... I guess I'm a bit of a n00b using Python.

Thanks for the help. 
Python really has a great community !
msg64633 - (view) Author: Skip Montanaro (skip.montanaro) * (Python triager) Date: 2008-03-28 12:28
Jean-Philippe> You're right, it does seem that using f.read(1024) to
    Jean-Philippe> feed the sniffer works OK in my case and allows me to
    Jean-Philippe> instantiate the DictReader correctly...  Why that is I'm
    Jean-Philippe> not sure though...

It works entirely based on chracter frequencies.  The more characters you
feed it the better it should be at guessing the correct delimiter.  In
particular, it pays attention to the frequency of the possible delimiters
per line and assumes the number of columns is the same for each line.
(Well, there's one place where it does use some knowledge of the structure
of a csv file, so my earlier assertion was incorrect.)  If you only feed it
one line it can't really use that frequency-per-line information.

    Jean-Philippe> I was submitting the first line as I thought is was the
    Jean-Philippe> right sample to provide the sniffer for it to sniff the
    Jean-Philippe> correct dialect regardless of the file format and file
    Jean-Philippe> content.

That's a good guess, but not quite spot on in this case.  In particular, the
character frequencies in the first line tend to be much different than the
other lines because it usually a row of column headers, while the remainder
of the file (though not always ;-) is a table of numbers.

Skip
msg64696 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2008-03-29 14:01
> It works entirely based on chracter frequencies.

Does it make sense to restrict delimiters to a reasonable set of
characters? Usual punctuations, spaces, tabs... what else?
msg64701 - (view) Author: Skip Montanaro (skip.montanaro) * (Python triager) Date: 2008-03-29 15:15
>> It works entirely based on chracter frequencies.

    Amaury> Does it make sense to restrict delimiters to a reasonable set of
    Amaury> characters? Usual punctuations, spaces, tabs... what else?

There is an optional delimiters argument to the sniff() method which
defaults to None.  I would be happier if it was "the usual suspects"
(NeoOffice refuses to gues, but offers TAB, space, semicolon and comma as
the default separators when importing a CSV file - Excel seems to just
figure it out).  That would change the behavior though.  With no delimiter
set it's generally going to find something, just pick incorrectly.  With a
non-existent delimiter set it's going to raise an exception.  I'm not sure
this would be a good tradeoff and would just break existing code.

Skip
msg65431 - (view) Author: Skip Montanaro (skip.montanaro) * (Python triager) Date: 2008-04-13 03:25
I can't see a great reason to change the behavior.  I've attached my
current patch for csv.py and test_csv.py in case someone else wants
to pick it up later.
History
Date User Action Args
2022-04-11 14:56:30adminsetgithub: 46350
2008-04-13 03:25:50skip.montanarosetstatus: open -> closed
files: + csv.diff
messages: + msg65431
priority: low
keywords: + patch
resolution: postponed
2008-03-29 15:15:27skip.montanarosetmessages: + msg64701
2008-03-29 14:01:21amaury.forgeotdarcsetnosy: + amaury.forgeotdarc
messages: + msg64696
2008-03-28 12:28:45skip.montanarosetmessages: + msg64633
2008-03-27 20:39:02jplaverduresetmessages: + msg64606
2008-03-27 16:07:26skip.montanarosetmessages: + msg64597
2008-03-27 15:19:38jplaverduresetmessages: + msg64595
2008-03-19 15:33:19skip.montanarosetmessages: + msg64062
2008-03-19 14:52:37tds333setmessages: + msg64059
2008-03-18 12:35:11skip.montanarosetassignee: skip.montanaro
messages: + msg63910
nosy: + skip.montanaro
2008-03-17 21:49:23tds333setnosy: + tds333
messages: + msg63788
versions: + Python 2.6, Python 3.0
2008-02-12 18:14:54jplaverduresetcomponents: + Library (Lib), - Extension Modules
versions: + Python 2.4
2008-02-12 16:01:11jplaverdurecreate