Issue 43625: CSV has_headers heuristic could be improved

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/87791

classification

Title:	CSV has_headers heuristic could be improved
Type:	behavior	Stage:	resolved
Components:	Library (Lib)	Versions:	Python 3.11, Python 3.10

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	andrei.avk, ejacq, lukasz.langa, miss-islington, rhettinger, skip.montanaro
Priority:	normal	Keywords:	patch

Created on 2021-03-25 18:18 by ejacq, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
csv.diff	skip.montanaro, 2021-03-25 21:03
hasheader.diff	skip.montanaro, 2021-06-29 12:18
2021-06-29-07-27-08.bpo-43625.ZlAxhp.rst	skip.montanaro, 2021-06-29 12:28

Pull Requests
URL	Status	Linked	Edit
PR 26939	merged	andrei.avk, 2021-06-28 20:51
PR 27494	merged	miss-islington, 2021-07-30 17:10

Messages (13)
msg389515 - (view)	Author: (ejacq)	Date: 2021-03-25 18:18
Here is an sample of CSV input: "time","forces" 0,0 0.5,0.9 when calling has_header() from csv.py on this sample, it returns false. Why? because 0 and 0.5 don't belong to the same type and thus the column is discarded by the heuristic. I think the heuristic will better work if rather than just comparing number types, it would also consider casting the values in this order int -> float -> complex. If the values are similar then consider this upgraded type as the type of the column. In the end, this file would be considered float columns with headers.
msg389528 - (view)	Author: Skip Montanaro (skip.montanaro) *	Date: 2021-03-25 21:03
I assume the OP is referring to this sort of usage: >>> sniffer = csv.Sniffer() >>> raw = open("mixed.csv").read() >>> sniffer.has_header(raw) False sigh I really wish the Sniffer class had never been added to the CSV module. I can't recall who wrote it (the author is long gone). Though I am responsible for the initial commits, it wasn't me or the main authors of csvmodule.c. As far as I know, it never really worked well. I can't recall ever using it. A simpler heuristic would be if the first row contains a bunch of strings and the second row contains a bunch of numbers, then the file has a header. That assumes that CSV files consist mostly of numeric data. Looking at has_header, I see this: for thisType in [int, float, complex]: I think this particular problem would be solved if the order of those types were reversed. The attached diff suggests that as well. Note that the Sniffer class currently contains no test cases, so that the test I added failed before the change and passes after doesn't mean it doesn't break someone's mission critical Sniffer usage. (Sorry, Raymond. My Github-foo is insufficient to allow me to fork, apply the diff and create a PR.)
msg396663 - (view)	Author: Andrei Kulakov (andrei.avk) *	Date: 2021-06-28 20:54
I've added the PR here, based on Skip's patch: https://github.com/python/cpython/pull/26939
msg396699 - (view)	Author: Andrei Kulakov (andrei.avk) *	Date: 2021-06-29 03:10
Skip: If I understand right, in the patch the last two types -- float and int, will never have an effect because if float(x) and int(x) succeed, so will complex(x), and conversely, if complex(x) fails, float and int will also fail. So the effect of the patch will be to tolerate any mix of numeric columns when the headers are textual. Which sounds fine to me, just want to confirm that sounds good to you, because the unit test in the patch is a much narrower case. I think the test should then be something like: a b c 1 1.1 5+0j 1.2 3+0j 4 and the code should be updated to just do `complex(x)` and remove float() and int() attempts. Another, more strict option, - would be to special case `0` to match int, float, and complex.
msg396714 - (view)	Author: Skip Montanaro (skip.montanaro) *	Date: 2021-06-29 11:50
Thanks @andrei.avk. You are right, only the complex test is required. I suppose it's okay to commit this, but reviewing the full code of the has_header method leaves me thinking this is just putting lipstick on a pig. If I read the code correctly, there are two (undocumented) tacit assumptions: 1. The non-header values are numeric. 2. If not numeric, they are fixed length strings whose length generally differs from the length of the putative header. The second criterion means this has a header: ab,de ghi,jkl ghi,jkl but this doesn't: abc,def ghi,jkl ghi,jkl It seems to me that it would be a good idea to at least expand on the documentation of that method and maybe add at least one test case where the CSV sample doesn't have a header. I'll try to get that done and attach a patch.
msg396716 - (view)	Author: Skip Montanaro (skip.montanaro) *	Date: 2021-06-29 11:59
I retract my comment about fixed length strings in the non-numeric case. There are clearly test cases (which I probably wrote, considering the values) where the sample as a header but the values are of varying length. Misread of the code on my part. I have obviously not had enough coffee yet this morning.
msg396720 - (view)	Author: Skip Montanaro (skip.montanaro) *	Date: 2021-06-29 12:18
Here is a change to the has_header documentation and an extra test case documenting the behavior when the sample contains strings. I'm not sure about the wording of the doc change, perhaps you can tweak it? Seems kind of clumsy to me. If it seems okay to you @andrei.avk, can you fold it into your PR?
msg396722 - (view)	Author: Skip Montanaro (skip.montanaro) *	Date: 2021-06-29 12:28
Here's a NEWS entry.
msg396733 - (view)	Author: Andrei Kulakov (andrei.avk) *	Date: 2021-06-29 14:26
Skip: I've updated the PR Expanded the docs a bit, I think they give a good rough idea of the logic but I feel like they can probably be improved more, let me know what you think! I spent some time thinking about it but found it tricky! :)
msg398575 - (view)	Author: Łukasz Langa (lukasz.langa) *	Date: 2021-07-30 17:10
New changeset ceea579ccc51791f3e115155d6f27905bc7544a9 by andrei kulakov in branch 'main': bpo-43625: Enhance csv sniffer has_headers() to be more accurate (GH-26939) https://github.com/python/cpython/commit/ceea579ccc51791f3e115155d6f27905bc7544a9
msg398581 - (view)	Author: Łukasz Langa (lukasz.langa) *	Date: 2021-07-30 17:21
Since this is a heuristic change, I'm thinking we shouldn't backport this to 3.9. This way it will be easier for application authors to notice the change and control it within their apps.
msg398582 - (view)	Author: Andrei Kulakov (andrei.avk) *	Date: 2021-07-30 17:23
Łukasz: I agree.
msg398586 - (view)	Author: Łukasz Langa (lukasz.langa) *	Date: 2021-07-30 17:30
New changeset 440c9f772a9b66c1ea387c1c3efc9ff438880acf by Miss Islington (bot) in branch '3.10': bpo-43625: Enhance csv sniffer has_headers() to be more accurate (GH-26939) (GH-27494) https://github.com/python/cpython/commit/440c9f772a9b66c1ea387c1c3efc9ff438880acf

History
Date	User	Action	Args
2022-04-11 14:59:43	admin	set	github: 87791
2021-07-30 17:30:38	lukasz.langa	set	status: open -> closed type: behavior resolution: fixed stage: patch review -> resolved
2021-07-30 17:30:16	lukasz.langa	set	messages: + msg398586
2021-07-30 17:23:01	andrei.avk	set	messages: + msg398582
2021-07-30 17:21:56	lukasz.langa	set	messages: + msg398581 versions: + Python 3.10
2021-07-30 17:10:47	miss-islington	set	nosy: + miss-islington pull_requests: + pull_request26010
2021-07-30 17:10:45	lukasz.langa	set	nosy: + lukasz.langa messages: + msg398575
2021-06-29 14:26:09	andrei.avk	set	messages: + msg396733
2021-06-29 12:28:18	skip.montanaro	set	files: + 2021-06-29-07-27-08.bpo-43625.ZlAxhp.rst messages: + msg396722
2021-06-29 12:18:31	skip.montanaro	set	files: + hasheader.diff messages: + msg396720
2021-06-29 11:59:09	skip.montanaro	set	messages: + msg396716
2021-06-29 11:50:13	skip.montanaro	set	messages: + msg396714 versions: + Python 3.11, - Python 3.7
2021-06-29 03:10:20	andrei.avk	set	messages: + msg396699
2021-06-28 20:54:30	andrei.avk	set	messages: + msg396663
2021-06-28 20:51:34	andrei.avk	set	nosy: + andrei.avk pull_requests: + pull_request25507 stage: patch review
2021-03-25 21:03:41	skip.montanaro	set	files: + csv.diff keywords: + patch messages: + msg389528
2021-03-25 20:27:44	rhettinger	set	nosy: + rhettinger
2021-03-25 20:27:32	rhettinger	set	nosy: + skip.montanaro
2021-03-25 18:18:35	ejacq	create