Title: CSV has_headers heuristic could be improved
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.11, Python 3.10
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: andrei.avk, ejacq, lukasz.langa, miss-islington, rhettinger, skip.montanaro
Priority: normal Keywords: patch

Created on 2021-03-25 18:18 by ejacq, last changed 2021-07-30 17:30 by lukasz.langa. This issue is now closed.

File name Uploaded Description Edit
csv.diff skip.montanaro, 2021-03-25 21:03
hasheader.diff skip.montanaro, 2021-06-29 12:18
2021-06-29-07-27-08.bpo-43625.ZlAxhp.rst skip.montanaro, 2021-06-29 12:28
Pull Requests
URL Status Linked Edit
PR 26939 merged andrei.avk, 2021-06-28 20:51
PR 27494 merged miss-islington, 2021-07-30 17:10
Messages (13)
msg389515 - (view) Author: (ejacq) Date: 2021-03-25 18:18
Here is an sample of CSV input:


when calling has_header() from on this sample, it returns false.
Why? because 0 and 0.5 don't belong to the same type and thus the column is discarded by the heuristic.

I think the heuristic will better work if rather than just comparing number types, it would also consider casting the values in this order int -> float -> complex. If the values are similar then consider this upgraded type as the type of the column.

In the end, this file would be considered float columns with headers.
msg389528 - (view) Author: Skip Montanaro (skip.montanaro) * (Python triager) Date: 2021-03-25 21:03
I assume the OP is referring to this sort of usage:

>>> sniffer = csv.Sniffer()
>>> raw = open("mixed.csv").read()
>>> sniffer.has_header(raw)


I really wish the Sniffer class had never been added to the CSV module. I can't recall who wrote it (the author is long gone). Though I am responsible for the initial commits, it wasn't me or the main authors of csvmodule.c. As far as I know, it never really worked well. I can't recall ever using it.

A simpler heuristic would be if the first row contains a bunch of strings and the second row contains a bunch of numbers, then the file has a header. That assumes that CSV files consist mostly of numeric data.

Looking at has_header, I see this:

    for thisType in [int, float, complex]:

I think this particular problem would be solved if the order of those types were reversed. The attached diff suggests that as well. Note that the Sniffer class currently contains no test cases, so that the test I added failed before the change and passes after doesn't mean it doesn't break someone's mission critical Sniffer usage.

(Sorry, Raymond. My Github-foo is insufficient to allow me to fork, apply the diff and create a PR.)
msg396663 - (view) Author: Andrei Kulakov (andrei.avk) * (Python triager) Date: 2021-06-28 20:54
I've added the PR here, based on Skip's patch:
msg396699 - (view) Author: Andrei Kulakov (andrei.avk) * (Python triager) Date: 2021-06-29 03:10
Skip: If I understand right, in the patch the last two types -- float and int, will never have an effect because if float(x) and int(x) succeed, so will complex(x), and conversely, if complex(x) fails, float and int will also fail.

So the effect of the patch will be to tolerate any mix of numeric columns when the headers are textual. Which sounds fine to me, just want to confirm that sounds good to you, because the unit test in the patch is a much narrower case.

I think the test should then be something like:

a    b    c
1   1.1   5+0j
1.2 3+0j  4 

and the code should be updated to just do `complex(x)` and remove float() and int() attempts.

Another, more strict option, - would be to special case `0` to match int, float, and complex.
msg396714 - (view) Author: Skip Montanaro (skip.montanaro) * (Python triager) Date: 2021-06-29 11:50
Thanks @andrei.avk. You are right, only the complex test is required.

I suppose it's okay to commit this, but reviewing the full code of the has_header method leaves me thinking this is just putting lipstick on a pig. If I read the code correctly, there are two (undocumented) tacit assumptions:

1. The non-header values are numeric.

2. If not numeric, they are fixed length strings whose length generally differs from the length of the putative header.

The second criterion means this has a header:


but this doesn't:


It seems to me that it would be a good idea to at least expand on the documentation of that method and maybe add at least one test case where the CSV sample doesn't have a header. I'll try to get that done and attach a patch.
msg396716 - (view) Author: Skip Montanaro (skip.montanaro) * (Python triager) Date: 2021-06-29 11:59
I retract my comment about fixed length strings in the non-numeric case. There are clearly test cases (which I probably wrote, considering the values) where the sample as a header but the values are of varying length. Misread of the code on my part. I have obviously not had enough coffee yet this morning.
msg396720 - (view) Author: Skip Montanaro (skip.montanaro) * (Python triager) Date: 2021-06-29 12:18
Here is a change to the has_header documentation and an extra test case documenting the behavior when the sample contains strings. I'm not sure about the wording of the doc change, perhaps you can tweak it? Seems kind of clumsy to me. If it seems okay to you @andrei.avk, can you fold it into your PR?
msg396722 - (view) Author: Skip Montanaro (skip.montanaro) * (Python triager) Date: 2021-06-29 12:28
Here's a NEWS entry.
msg396733 - (view) Author: Andrei Kulakov (andrei.avk) * (Python triager) Date: 2021-06-29 14:26
Skip: I've updated the PR

Expanded the docs a bit, I think they give a good rough idea of the logic but I feel like they can probably be improved more, let me know what you think! I spent some time thinking about it but found it tricky! :)
msg398575 - (view) Author: Łukasz Langa (lukasz.langa) * (Python committer) Date: 2021-07-30 17:10
New changeset ceea579ccc51791f3e115155d6f27905bc7544a9 by andrei kulakov in branch 'main':
bpo-43625: Enhance csv sniffer has_headers() to be more accurate (GH-26939)
msg398581 - (view) Author: Łukasz Langa (lukasz.langa) * (Python committer) Date: 2021-07-30 17:21
Since this is a heuristic change, I'm thinking we shouldn't backport this to 3.9. This way it will be easier for application authors to notice the change and control it within their apps.
msg398582 - (view) Author: Andrei Kulakov (andrei.avk) * (Python triager) Date: 2021-07-30 17:23
Łukasz: I agree.
msg398586 - (view) Author: Łukasz Langa (lukasz.langa) * (Python committer) Date: 2021-07-30 17:30
New changeset 440c9f772a9b66c1ea387c1c3efc9ff438880acf by Miss Islington (bot) in branch '3.10':
bpo-43625: Enhance csv sniffer has_headers() to be more accurate (GH-26939) (GH-27494)
Date User Action Args
2021-07-30 17:30:38lukasz.langasetstatus: open -> closed
type: behavior
resolution: fixed
stage: patch review -> resolved
2021-07-30 17:30:16lukasz.langasetmessages: + msg398586
2021-07-30 17:23:01andrei.avksetmessages: + msg398582
2021-07-30 17:21:56lukasz.langasetmessages: + msg398581
versions: + Python 3.10
2021-07-30 17:10:47miss-islingtonsetnosy: + miss-islington
pull_requests: + pull_request26010
2021-07-30 17:10:45lukasz.langasetnosy: + lukasz.langa
messages: + msg398575
2021-06-29 14:26:09andrei.avksetmessages: + msg396733
2021-06-29 12:28:18skip.montanarosetfiles: + 2021-06-29-07-27-08.bpo-43625.ZlAxhp.rst

messages: + msg396722
2021-06-29 12:18:31skip.montanarosetfiles: + hasheader.diff

messages: + msg396720
2021-06-29 11:59:09skip.montanarosetmessages: + msg396716
2021-06-29 11:50:13skip.montanarosetmessages: + msg396714
versions: + Python 3.11, - Python 3.7
2021-06-29 03:10:20andrei.avksetmessages: + msg396699
2021-06-28 20:54:30andrei.avksetmessages: + msg396663
2021-06-28 20:51:34andrei.avksetnosy: + andrei.avk

pull_requests: + pull_request25507
stage: patch review
2021-03-25 21:03:41skip.montanarosetfiles: + csv.diff
keywords: + patch
messages: + msg389528
2021-03-25 20:27:44rhettingersetnosy: + rhettinger
2021-03-25 20:27:32rhettingersetnosy: + skip.montanaro
2021-03-25 18:18:35ejacqcreate