Message 64633 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	skip.montanaro
Recipients	jplaverdure, skip.montanaro, tds333
Date	2008-03-28.12:28:41
SpamBayes Score	0.08532603
Marked as misclassified	No
Message-id	<18412.4165.440393.304270@montanaro-dyndns-org.local>
In-reply-to	<1206650343.02.0.115382198287.issue2078@psf.upfronthosting.co.za>

Content
Jean-Philippe> You're right, it does seem that using f.read(1024) to Jean-Philippe> feed the sniffer works OK in my case and allows me to Jean-Philippe> instantiate the DictReader correctly... Why that is I'm Jean-Philippe> not sure though... It works entirely based on chracter frequencies. The more characters you feed it the better it should be at guessing the correct delimiter. In particular, it pays attention to the frequency of the possible delimiters per line and assumes the number of columns is the same for each line. (Well, there's one place where it does use some knowledge of the structure of a csv file, so my earlier assertion was incorrect.) If you only feed it one line it can't really use that frequency-per-line information. Jean-Philippe> I was submitting the first line as I thought is was the Jean-Philippe> right sample to provide the sniffer for it to sniff the Jean-Philippe> correct dialect regardless of the file format and file Jean-Philippe> content. That's a good guess, but not quite spot on in this case. In particular, the character frequencies in the first line tend to be much different than the other lines because it usually a row of column headers, while the remainder of the file (though not always ;-) is a table of numbers. Skip

Jean-Philippe> You're right, it does seem that using f.read(1024) to
    Jean-Philippe> feed the sniffer works OK in my case and allows me to
    Jean-Philippe> instantiate the DictReader correctly...  Why that is I'm
    Jean-Philippe> not sure though...

It works entirely based on chracter frequencies.  The more characters you
feed it the better it should be at guessing the correct delimiter.  In
particular, it pays attention to the frequency of the possible delimiters
per line and assumes the number of columns is the same for each line.
(Well, there's one place where it does use some knowledge of the structure
of a csv file, so my earlier assertion was incorrect.)  If you only feed it
one line it can't really use that frequency-per-line information.

    Jean-Philippe> I was submitting the first line as I thought is was the
    Jean-Philippe> right sample to provide the sniffer for it to sniff the
    Jean-Philippe> correct dialect regardless of the file format and file
    Jean-Philippe> content.

That's a good guess, but not quite spot on in this case.  In particular, the
character frequencies in the first line tend to be much different than the
other lines because it usually a row of column headers, while the remainder
of the file (though not always ;-) is a table of numbers.

Skip

History
Date	User	Action	Args
2008-03-28 12:28:47	skip.montanaro	set	spambayes_score: 0.085326 -> 0.08532603 recipients: + skip.montanaro, tds333, jplaverdure
2008-03-28 12:28:45	skip.montanaro	link	issue2078 messages
2008-03-28 12:28:43	skip.montanaro	create