Issue 2078: CSV Sniffer does not function properly on single column .csv files

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/46350

classification

Title:	CSV Sniffer does not function properly on single column .csv files
Type:	behavior	Stage:
Components:	Library (Lib)	Versions:	Python 3.0, Python 2.4, Python 2.6, Python 2.5

process

Status:	closed	Resolution:	postponed
Dependencies:		Superseder:
Assigned To:	skip.montanaro	Nosy List:	amaury.forgeotdarc, jplaverdure, skip.montanaro, tds333
Priority:	low	Keywords:	patch

Created on 2008-02-12 16:01 by jplaverdure, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
listB2Mforblast.csv	jplaverdure, 2008-02-12 16:01	Single column .csv file
csv.diff	skip.montanaro, 2008-04-13 03:25	patch making comma the preferred delimiter for single column files

Messages (12)
msg62319 - (view)	Author: Jean-Philippe Laverdure (jplaverdure)	Date: 2008-02-12 16:01
When attempting to sniff() the dialect for the attached .csv file, csv.Sniffer.sniff() returns an unusable dialect: >>> import csv >>> file = open('listB2Mforblast.csv', 'r') >>> dialect = csv.Sniffer().sniff(file.readline()) >>> file.seek(0) >>> file.readline() >>> file.seek(0) >>> reader = csv.DictReader(file, dialect) >>> reader.next() Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/soft/bioinfo/linux/python-2.5/lib/python2.5/csv.py", line 93, in next d = dict(zip(self.fieldnames, row)) TypeError: zip argument #1 must support iteration However, this works fine: >>> file.seek(0) >>> reader = csv.DictReader(file) >>> reader.next() {'Sequence': 'AALENTHLL'} If I use a 2 column file, sniff() works perfectly. It only seems to have a problem with single column .csv files (which are still .csv files in my opinion) Thanks for looking into this.
msg63788 - (view)	Author: Wolfgang Langner (tds333) *	Date: 2008-03-17 21:49
The sniffer returns an dialect that is not really correct. Because the delimiter is set to value and in this case there is no delimiter. See it as, it returns a random delimiter if there is not really one. But your usage of the DictReader is wrong. It have to be called with csv.DictReader(file, dialect=dialect) and then it works in this example. This could be closed.
msg63910 - (view)	Author: Skip Montanaro (skip.montanaro) *	Date: 2008-03-18 12:35
What do you think the delimiter should be for this csv file? 43.4e12 147483648 47483648 What about this one? abcdef bcdefg cdefgh And this? abc8def bcd8efg cde8fgh If I force the sniffer to not allow digits or letters as delimiters I can get the sniffer to return comma as the delimiter in all three cases. I'm not certain that's correct in the third case though.
msg64059 - (view)	Author: Wolfgang Langner (tds333) *	Date: 2008-03-19 14:52
In this cases it is not really possible to sniff the right delimiter. To not allow digits or letters is not a good solution. I think the behavior as now is ok, and at this time I see now way to improve it.
msg64062 - (view)	Author: Skip Montanaro (skip.montanaro) *	Date: 2008-03-19 15:33
Wolfgang> In this cases it is not really possible to sniff the right Wolfgang> delimiter. To not allow digits or letters is not a good Wolfgang> solution. I think the behavior as now is ok, and at this time Wolfgang> I see now way to improve it. I mostly agree. I'm waiting for the original submitter to chime in though. Skip
msg64595 - (view)	Author: Jean-Philippe Laverdure (jplaverdure)	Date: 2008-03-27 15:19
Hello and sorry for the late reply. Wolfgang: sorry about my misuse of the csv.DictReader constructor, that was a mistake on my part. However, it still is not functionning as I think it should/could. Look at this: Using this content: Sequence AAGINRDSL AAIANHQVL and this piece of code: f = open(sys.argv[-1], 'r') dialect = csv.Sniffer().sniff(f.readline()) f.seek(0) reader = csv.DictReader(f, dialect=dialect) for line in reader: print line I get this result: {'Sequen': 'AAGINRDSL', 'e': None} {'Sequen': 'AAIANHQVL', 'e': None} When I really should be getting this: {'Sequence': 'AAGINRDSL'} {'Sequence': 'AAIANHQVL'} The fact is this code is in use in an application where users can submit a .csv file produced by Excel for treatment. The file must contain a "Sequence" column since that is what the treatment is run on. Now I had to make the following changes to my code to account for the fact that some users submit a single column file (since only the "Sequence" column is required for treatment): f = open(sys.argv[-1], 'r') try: dialect = csv.Sniffer().sniff(f.readline(), [',', '\t']) f.seek(0) reader = csv.DictReader(f, dialect=dialect) except: print '>>>caught csv sniff() exception' f.seek(0) reader = csv.DictReader(f) for line in reader: Do what I need to do Which really feels like a patched use of a buggy implementation of the Sniffer class I understand the issues raised by Skip in regards to figuring out a delimiter at all costs... But really, the Sniffer class should work apropriately when a single column .csv file is submitted
msg64597 - (view)	Author: Skip Montanaro (skip.montanaro) *	Date: 2008-03-27 16:07
Jean-Philippe> The fact is this code is in use in an application where Jean-Philippe> users can submit a .csv file produced by Excel for Jean-Philippe> treatment. The file must contain a "Sequence" column Jean-Philippe> since that is what the treatment is run on. Now I had to Jean-Philippe> make the following changes to my code to account for the Jean-Philippe> fact that some users submit a single column file (since Jean-Philippe> only the "Sequence" column is required for treatment): Jean-Philippe> f = open(sys.argv[-1], 'r') Jean-Philippe> try: Jean-Philippe> dialect = csv.Sniffer().sniff(f.readline(), [',', '\t']) Jean-Philippe> f.seek(0) Jean-Philippe> reader = csv.DictReader(f, dialect=dialect) Jean-Philippe> except: Jean-Philippe> print '>>>caught csv sniff() exception' Jean-Philippe> f.seek(0) Jean-Philippe> reader = csv.DictReader(f) Jean-Philippe> for line in reader: Jean-Philippe> Do what I need to do What exceptions are you catching? Why are you only giving it a single line of input as a sample? What happens if you instead use f.read(1024) as the sample? When there is only a single column in the file and you give it a delimiter set which doesn't include any characters in the file it (I think correctly) raises an exception to tell you that it couldn't determine the delimiter: >>> import csv >>> f = open("listB2Mforblast.csv") >>> dialect = csv.Sniffer().sniff(f.read(1024)) >>> dialect.delimiter '"' >>> f.seek(0) >>> dialect = csv.Sniffer().sniff(f.read(1024), ",\t :;") Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/skip/local/lib/python2.6/csv.py", line 161, in sniff raise Error, "Could not determine delimiter" _csv.Error: Could not determine delimiter In that case, use csv.excel as the dialect. It doesn't matter what you use as the delimiter if it doesn't occur in the file, and if it can't figure out the delimiter it's also not going to guess the quotechar. >>> try: ... dialect = csv.Sniffer().sniff(f.read(1024), ",\t :;") ... except csv.Error: ... dialect = csv.excel ... I personally don't much like the sniffer. It doesn't use any knowledge of the structure of a CSV file to guess the delimiter and quotechar (and those are the only two parameters it does guess). I would prefer if it just went away, but folks use it so it's likely to remain in its current form for the forseeable future. Skip
msg64606 - (view)	Author: Jean-Philippe Laverdure (jplaverdure)	Date: 2008-03-27 20:39
Hi Skip, You're right, it does seem that using f.read(1024) to feed the sniffer works OK in my case and allows me to instantiate the DictReader correctly... Why that is I'm not sure though... I was submitting the first line as I thought is was the right sample to provide the sniffer for it to sniff the correct dialect regardless of the file format and file content. And yes, 'except csv.Error' is certainly a better way to trap my desired exception... I guess I'm a bit of a n00b using Python. Thanks for the help. Python really has a great community !
msg64633 - (view)	Author: Skip Montanaro (skip.montanaro) *	Date: 2008-03-28 12:28
Jean-Philippe> You're right, it does seem that using f.read(1024) to Jean-Philippe> feed the sniffer works OK in my case and allows me to Jean-Philippe> instantiate the DictReader correctly... Why that is I'm Jean-Philippe> not sure though... It works entirely based on chracter frequencies. The more characters you feed it the better it should be at guessing the correct delimiter. In particular, it pays attention to the frequency of the possible delimiters per line and assumes the number of columns is the same for each line. (Well, there's one place where it does use some knowledge of the structure of a csv file, so my earlier assertion was incorrect.) If you only feed it one line it can't really use that frequency-per-line information. Jean-Philippe> I was submitting the first line as I thought is was the Jean-Philippe> right sample to provide the sniffer for it to sniff the Jean-Philippe> correct dialect regardless of the file format and file Jean-Philippe> content. That's a good guess, but not quite spot on in this case. In particular, the character frequencies in the first line tend to be much different than the other lines because it usually a row of column headers, while the remainder of the file (though not always ;-) is a table of numbers. Skip
msg64696 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *	Date: 2008-03-29 14:01
> It works entirely based on chracter frequencies. Does it make sense to restrict delimiters to a reasonable set of characters? Usual punctuations, spaces, tabs... what else?
msg64701 - (view)	Author: Skip Montanaro (skip.montanaro) *	Date: 2008-03-29 15:15
>> It works entirely based on chracter frequencies. Amaury> Does it make sense to restrict delimiters to a reasonable set of Amaury> characters? Usual punctuations, spaces, tabs... what else? There is an optional delimiters argument to the sniff() method which defaults to None. I would be happier if it was "the usual suspects" (NeoOffice refuses to gues, but offers TAB, space, semicolon and comma as the default separators when importing a CSV file - Excel seems to just figure it out). That would change the behavior though. With no delimiter set it's generally going to find something, just pick incorrectly. With a non-existent delimiter set it's going to raise an exception. I'm not sure this would be a good tradeoff and would just break existing code. Skip
msg65431 - (view)	Author: Skip Montanaro (skip.montanaro) *	Date: 2008-04-13 03:25
I can't see a great reason to change the behavior. I've attached my current patch for csv.py and test_csv.py in case someone else wants to pick it up later.

History
Date	User	Action	Args
2022-04-11 14:56:30	admin	set	github: 46350
2008-04-13 03:25:50	skip.montanaro	set	status: open -> closed files: + csv.diff messages: + msg65431 priority: low keywords: + patch resolution: postponed
2008-03-29 15:15:27	skip.montanaro	set	messages: + msg64701
2008-03-29 14:01:21	amaury.forgeotdarc	set	nosy: + amaury.forgeotdarc messages: + msg64696
2008-03-28 12:28:45	skip.montanaro	set	messages: + msg64633
2008-03-27 20:39:02	jplaverdure	set	messages: + msg64606
2008-03-27 16:07:26	skip.montanaro	set	messages: + msg64597
2008-03-27 15:19:38	jplaverdure	set	messages: + msg64595
2008-03-19 15:33:19	skip.montanaro	set	messages: + msg64062
2008-03-19 14:52:37	tds333	set	messages: + msg64059
2008-03-18 12:35:11	skip.montanaro	set	assignee: skip.montanaro messages: + msg63910 nosy: + skip.montanaro
2008-03-17 21:49:23	tds333	set	nosy: + tds333 messages: + msg63788 versions: + Python 2.6, Python 3.0
2008-02-12 18:14:54	jplaverdure	set	components: + Library (Lib), - Extension Modules versions: + Python 2.4
2008-02-12 16:01:11	jplaverdure	create