Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSV has_headers heuristic could be improved #87791

Closed
ejacq mannequin opened this issue Mar 25, 2021 · 13 comments
Closed

CSV has_headers heuristic could be improved #87791

ejacq mannequin opened this issue Mar 25, 2021 · 13 comments
Labels
3.10 only security fixes 3.11 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@ejacq
Copy link
Mannequin

ejacq mannequin commented Mar 25, 2021

BPO 43625
Nosy @smontanaro, @rhettinger, @ambv, @miss-islington, @akulakov
PRs
  • bpo-43625: Enhance csv sniffer has_headers() to be more accurate #26939
  • [3.10] bpo-43625: Enhance csv sniffer has_headers() to be more accurate (GH-26939) #27494
  • Files
  • csv.diff
  • hasheader.diff
  • 2021-06-29-07-27-08.bpo-43625.ZlAxhp.rst
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2021-07-30.17:30:38.304>
    created_at = <Date 2021-03-25.18:18:35.215>
    labels = ['type-bug', 'library', '3.10', '3.11']
    title = 'CSV has_headers heuristic could be improved'
    updated_at = <Date 2021-07-30.17:30:38.304>
    user = 'https://bugs.python.org/ejacq'

    bugs.python.org fields:

    activity = <Date 2021-07-30.17:30:38.304>
    actor = 'lukasz.langa'
    assignee = 'none'
    closed = True
    closed_date = <Date 2021-07-30.17:30:38.304>
    closer = 'lukasz.langa'
    components = ['Library (Lib)']
    creation = <Date 2021-03-25.18:18:35.215>
    creator = 'ejacq'
    dependencies = []
    files = ['49915', '50131', '50132']
    hgrepos = []
    issue_num = 43625
    keywords = ['patch']
    message_count = 13.0
    messages = ['389515', '389528', '396663', '396699', '396714', '396716', '396720', '396722', '396733', '398575', '398581', '398582', '398586']
    nosy_count = 6.0
    nosy_names = ['skip.montanaro', 'rhettinger', 'lukasz.langa', 'miss-islington', 'andrei.avk', 'ejacq']
    pr_nums = ['26939', '27494']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue43625'
    versions = ['Python 3.10', 'Python 3.11']

    @ejacq
    Copy link
    Mannequin Author

    ejacq mannequin commented Mar 25, 2021

    Here is an sample of CSV input:

    "time","forces"
    0,0
    0.5,0.9

    when calling has_header() from csv.py on this sample, it returns false.
    Why? because 0 and 0.5 don't belong to the same type and thus the column is discarded by the heuristic.

    I think the heuristic will better work if rather than just comparing number types, it would also consider casting the values in this order int -> float -> complex. If the values are similar then consider this upgraded type as the type of the column.

    In the end, this file would be considered float columns with headers.

    @ejacq ejacq mannequin added 3.7 (EOL) end of life stdlib Python modules in the Lib dir labels Mar 25, 2021
    @smontanaro
    Copy link
    Contributor

    I assume the OP is referring to this sort of usage:

    >>> sniffer = csv.Sniffer()
    >>> raw = open("mixed.csv").read()
    >>> sniffer.has_header(raw)
    False

    *sigh*

    I really wish the Sniffer class had never been added to the CSV module. I can't recall who wrote it (the author is long gone). Though I am responsible for the initial commits, it wasn't me or the main authors of csvmodule.c. As far as I know, it never really worked well. I can't recall ever using it.

    A simpler heuristic would be if the first row contains a bunch of strings and the second row contains a bunch of numbers, then the file has a header. That assumes that CSV files consist mostly of numeric data.

    Looking at has_header, I see this:

        for thisType in [int, float, complex]:

    I think this particular problem would be solved if the order of those types were reversed. The attached diff suggests that as well. Note that the Sniffer class currently contains no test cases, so that the test I added failed before the change and passes after doesn't mean it doesn't break someone's mission critical Sniffer usage.

    (Sorry, Raymond. My Github-foo is insufficient to allow me to fork, apply the diff and create a PR.)

    @akulakov
    Copy link
    Contributor

    I've added the PR here, based on Skip's patch:
    #26939

    @akulakov
    Copy link
    Contributor

    Skip: If I understand right, in the patch the last two types -- float and int, will never have an effect because if float(x) and int(x) succeed, so will complex(x), and conversely, if complex(x) fails, float and int will also fail.

    So the effect of the patch will be to tolerate any mix of numeric columns when the headers are textual. Which sounds fine to me, just want to confirm that sounds good to you, because the unit test in the patch is a much narrower case.

    I think the test should then be something like:

    a b c
    1 1.1 5+0j
    1.2 3+0j 4

    and the code should be updated to just do complex(x) and remove float() and int() attempts.

    Another, more strict option, - would be to special case 0 to match int, float, and complex.

    @smontanaro
    Copy link
    Contributor

    Thanks @andrei.avk. You are right, only the complex test is required.

    I suppose it's okay to commit this, but reviewing the full code of the has_header method leaves me thinking this is just putting lipstick on a pig. If I read the code correctly, there are two (undocumented) tacit assumptions:

    1. The non-header values are numeric.

    2. If not numeric, they are fixed length strings whose length generally differs from the length of the putative header.

    The second criterion means this has a header:

    ab,de
    ghi,jkl
    ghi,jkl

    but this doesn't:

    abc,def
    ghi,jkl
    ghi,jkl

    It seems to me that it would be a good idea to at least expand on the documentation of that method and maybe add at least one test case where the CSV sample doesn't have a header. I'll try to get that done and attach a patch.

    @smontanaro smontanaro added 3.11 only security fixes and removed 3.7 (EOL) end of life labels Jun 29, 2021
    @smontanaro
    Copy link
    Contributor

    I retract my comment about fixed length strings in the non-numeric case. There are clearly test cases (which I probably wrote, considering the values) where the sample as a header but the values are of varying length. Misread of the code on my part. I have obviously not had enough coffee yet this morning.

    @smontanaro
    Copy link
    Contributor

    Here is a change to the has_header documentation and an extra test case documenting the behavior when the sample contains strings. I'm not sure about the wording of the doc change, perhaps you can tweak it? Seems kind of clumsy to me. If it seems okay to you @andrei.avk, can you fold it into your PR?

    @smontanaro
    Copy link
    Contributor

    Here's a NEWS entry.

    @akulakov
    Copy link
    Contributor

    Skip: I've updated the PR

    Expanded the docs a bit, I think they give a good rough idea of the logic but I feel like they can probably be improved more, let me know what you think! I spent some time thinking about it but found it tricky! :)

    @ambv
    Copy link
    Contributor

    ambv commented Jul 30, 2021

    New changeset ceea579 by andrei kulakov in branch 'main':
    bpo-43625: Enhance csv sniffer has_headers() to be more accurate (GH-26939)
    ceea579

    @ambv
    Copy link
    Contributor

    ambv commented Jul 30, 2021

    Since this is a heuristic change, I'm thinking we shouldn't backport this to 3.9. This way it will be easier for application authors to notice the change and control it within their apps.

    @ambv ambv added 3.10 only security fixes labels Jul 30, 2021
    @akulakov
    Copy link
    Contributor

    Łukasz: I agree.

    @ambv
    Copy link
    Contributor

    ambv commented Jul 30, 2021

    New changeset 440c9f7 by Miss Islington (bot) in branch '3.10':
    bpo-43625: Enhance csv sniffer has_headers() to be more accurate (GH-26939) (GH-27494)
    440c9f7

    @ambv ambv closed this as completed Jul 30, 2021
    @ambv ambv added the type-bug An unexpected behavior, bug, or error label Jul 30, 2021
    @ambv ambv closed this as completed Jul 30, 2021
    @ambv ambv added the type-bug An unexpected behavior, bug, or error label Jul 30, 2021
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.10 only security fixes 3.11 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants