Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

csv.DictWriter is slow when writing files with large number of columns #62419

Closed
mtraskin mannequin opened this issue Jun 15, 2013 · 23 comments
Closed

csv.DictWriter is slow when writing files with large number of columns #62419

mtraskin mannequin opened this issue Jun 15, 2013 · 23 comments
Labels
3.7 (EOL) end of life performance Performance or resource usage stdlib Python modules in the Lib dir

Comments

@mtraskin
Copy link
Mannequin

mtraskin mannequin commented Jun 15, 2013

BPO 18219
Nosy @terryjreedy, @vstinner, @bitdancer, @methane, @serhiy-storchaka, @Mariatta, @hughdbrown
PRs
  • [Do Not Merge] Convert Misc/NEWS so that it is managed by towncrier #552
  • Files
  • csvdictwriter.patch
  • csvdictwriter.v2.patch
  • csvdictwriter.v3.patch
  • csvdictwriter.v4.patch
  • issue18219.patch
  • issue18219v2.patch
  • issue18219v3.patch
  • issue18219v4.patch
  • issue18219v5.patch
  • issue18219v6.patch
  • issue18219v7.patch
  • issue18219v8.patch
  • issue18219v9.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2016-10-21.10:54:34.855>
    created_at = <Date 2013-06-15.05:12:39.063>
    labels = ['3.7', 'library', 'performance']
    title = 'csv.DictWriter is slow when writing files with large number of columns'
    updated_at = <Date 2017-03-31.16:36:23.298>
    user = 'https://bugs.python.org/mtraskin'

    bugs.python.org fields:

    activity = <Date 2017-03-31.16:36:23.298>
    actor = 'dstufft'
    assignee = 'none'
    closed = True
    closed_date = <Date 2016-10-21.10:54:34.855>
    closer = 'methane'
    components = ['Library (Lib)']
    creation = <Date 2013-06-15.05:12:39.063>
    creator = 'mtraskin'
    dependencies = []
    files = ['30598', '30605', '31297', '31570', '45167', '45168', '45169', '45170', '45172', '45173', '45174', '45175', '45176']
    hgrepos = []
    issue_num = 18219
    keywords = ['patch']
    message_count = 23.0
    messages = ['191197', '191198', '191263', '191604', '195233', '195245', '196821', '279058', '279101', '279105', '279107', '279108', '279109', '279115', '279116', '279117', '279118', '279119', '279120', '279121', '279124', '279128', '279129']
    nosy_count = 10.0
    nosy_names = ['terry.reedy', 'peter.otten', 'vstinner', 'r.david.murray', 'methane', 'python-dev', 'serhiy.storchaka', 'mtraskin', 'Mariatta', 'hughdbrown']
    pr_nums = ['552']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue18219'
    versions = ['Python 3.6', 'Python 3.7']

    @mtraskin
    Copy link
    Mannequin Author

    mtraskin mannequin commented Jun 15, 2013

    _dict_to_list method of the csv.DictWriter objects created with extrasaction="raise" uses look-up in the list of field names to check if current row has any unknown fields. This results in O(n^2) execution time and is very slow if there are a lot of columns in a CSV file (in hundreds or thousands). Replacing look-up in a list with a look-up in a set solves the issue (see the attached patch).

    @mtraskin mtraskin mannequin added stdlib Python modules in the Lib dir performance Performance or resource usage labels Jun 15, 2013
    @serhiy-storchaka
    Copy link
    Member

    I think there is no need in public fieldset property. Just use private self._fieldset field in private _dict_to_list() method.

    @mtraskin
    Copy link
    Mannequin Author

    mtraskin mannequin commented Jun 16, 2013

    Any way is fine with me. If you prefer to avoid having public filedset property, please use the attached patch.

    @terryjreedy
    Copy link
    Member

    What is the purpose in touching fieldnames, either in tuple-izing it or in making it private and wrapped with a property. If someone wants to modify it, that is up to them. In any case, this change is not germane to the issue and could break code, so I would not make it.

    wrong_fields could be calculated with
    any(k for k in rowdict if k not in self._fieldset)
    to stop on the first extra, if any.

    That said, in 3.x, replacing

    wrong_fields = <long expression>
    if wrong_fields:
    with
    if rowdict.keys() - self._fieldset:
    should be even faster because the iteration, which will nearly always go to completion, is entirely in C (or whatever).

    Does test/text_cvs have tests for DictWriter, both with and without rowdict errors? If so, or if added, I would be willing to commit a patch that simply added ._fieldset and used it as above for a set difference.

    Also, if you have not done so yet, please go to
    http://www.python.org/psf/contrib/ and
    http://www.python.org/psf/contrib/contrib-form/ new electronic form
    and submit a contributor agreement. An '*' will appear after your name here when it is processed.

    @mtraskin
    Copy link
    Mannequin Author

    mtraskin mannequin commented Aug 15, 2013

    What is the purpose in touching fieldnames [...]

    Wrapping the fieldnames property and tupleizing it guarantees that fieldnames and _fieldset fields are consistent.
    Otherwise, having a separate _fieldset field means that someone who is modifying the fieldnames field will not modify the _fieldset. This will result in inconsistent DictWriter behavior. Normal DictWriter users (ones that do not modify fieldnames after DictWriter was created) will not notice this wrapper. "Non-normal" DictWriter will have their code broken, but it is better than having inconsistent internal data structures since these errors are very hard to detect. If you insist on keeping the interface intact, then use the attached v3 of the patch: it creates a fieldset object every time the _dict_to_list method is executed. This does slow execution down, but performance is acceptable, just about 1.5 time slower than version with _fieldset field.

    wrong_fields could be calculated with [...]
    I believe it is better to report all wrong fields at ones. In addition this optimization is meaningless, since usually, unless something is wrong, the field check will require full scan of the rowdict.

    That said, in 3.x, replacing [...]

    In 2.x the list comprehension version is faster than the set difference version. In 3.x the set difference is slightly faster (maybe 10% faster). However, list comprehension works both in 2.x and 3.x, while set difference requires different code for them. Hence I prefer sticking with list comprehension.

    Does test/text_cvs have tests [...]
    No there are no tests for wrong fields. Correct fields are already checked with standard writing tests. I do not know how you write tests for exception handling. If you provide a link with instructions, I can write the missing test part.

    Also, if you have not done so yet, please go to [...]
    I have already done this.

    @peterotten
    Copy link
    Mannequin

    peterotten mannequin commented Aug 15, 2013

    Note that set operations on dict views work with lists, too. So the only change necessary is to replace

    wrong_fields = [k for k in rowdict if k not in self.fieldnames]

    with

    wrong_fields = rowdict.keys() - self.filenames

    (A backport to 2.7 would need to replace keys() with viewkeys())

    @mtraskin
    Copy link
    Mannequin Author

    mtraskin mannequin commented Sep 3, 2013

    Peter, thank you for letting me know that views work with list, I was not aware of this. This is indeed the best solution and it also keeps the DictWriter interface unchanged.

    Terry, attached patch contains the DictWriter change and a test case in test_csv.py.

    @hughdbrown
    Copy link
    Mannequin

    hughdbrown mannequin commented Oct 20, 2016

    I came across this problem today when I was using a 1000+ column CSV from a client. It was taking about 15 minutes to process each file. I found the problem and made this change:

                # wrong_fields = [k for k in rowdict if k not in self.fieldnames]
                wrong_fields = set(rowdict.keys()) - set(self.fieldnames)

    And my processing time went down to 12 seconds per file -- a 75x speedup.

    It's kind of sad that this change has been waiting for over three years when it is so simple. Any chance we could make one of the acceptable code changes and release it?

    @SilentGhost SilentGhost mannequin added the 3.7 (EOL) end of life label Oct 20, 2016
    @Mariatta
    Copy link
    Member

    Hello, please review my patch.

    I used set subtraction to calculate wrong_fields, added more test cases, and clarify documentation with regards to extrasaction parameter.

    Please let me know if this works.

    Thanks :)

    @hughdbrown
    Copy link
    Mannequin

    hughdbrown mannequin commented Oct 21, 2016

    Fabulous. Looks great. Let's ship!

    It is not the *optimal* fix for 3.x platforms. A better fix would calculate the set of fieldnames only once in __init__ (or only as often as fieldnames is changed).

    But I stress that it is a robust change that works in versions 2.7 through 3.x for sure. And it is *way* better than the alternative of searching a list.

    @Mariatta
    Copy link
    Member

    Thanks Hugh,

    Are you thinking of something like the following?

    class DictWriter:
        def __init__(self, f, fieldnames, restval="", extrasaction="raise",
                     dialect="excel", *args, **kwds):
            self._fieldnames = fieldnames    # list of keys for the dict
            self._fieldnames_set = set(self._fieldnames)
    
        @property
        def fieldnames(self):
            return self._fieldnames
    
        @fieldnames.setter
        def fieldnames(self, value):
            self._fieldnames = value
            self._fieldnames_set = set(self._fieldnames)
    
    
        def _dict_to_list(self, rowdict):
            if self.extrasaction == "raise":
                wrong_fields = rowdict.keys() - self._fieldnames_set
         
        ...

    If so, I can work on another patch.
    Thanks.

    @hughdbrown
    Copy link
    Mannequin

    hughdbrown mannequin commented Oct 21, 2016

    Mariatta:

    Yes, that is what I was thinking of.

    That takes my 12 execution time down to 10 seconds. (Or, at least, a fix I did of this nature had that effect -- I have not timed your patch but it should be the same.)

    @Mariatta
    Copy link
    Member

    Thanks, Hugh.
    Please check the updated patch :)

    @methane
    Copy link
    Member

    methane commented Oct 21, 2016

    LGTM, Thanks Mariatta.
    (But one more LGTM from coredev is required for commit)

    @vstinner
    Copy link
    Member

    issue18219v6.patch: LGTM, but I added a minor PEP-8 comment.

    INADA Naoki: "LGTM, Thanks Mariatta. (But one more LGTM from coredev is required for commit)"

    If you are confident (ex: if the change is simple, like this one), you can push it directly.

    @Mariatta
    Copy link
    Member

    Inada-san, Victor, thank you.

    Here is the updated patch.

    @methane
    Copy link
    Member

    methane commented Oct 21, 2016

    If you are confident (ex: if the change is simple, like this one), you can push it directly.

    My mentor (Yury) prohibit it while I'm beginner.
    And as you saw, I missed PEP-8 violation :)

    @vstinner
    Copy link
    Member

    My mentor (Yury) prohibit it while I'm beginner.

    Oh right, trust your mentor more than me ;-)

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Oct 21, 2016

    New changeset 1928074e6519 by INADA Naoki in branch '3.6':
    Issue bpo-18219: Optimize csv.DictWriter for large number of columns.
    https://hg.python.org/cpython/rev/1928074e6519

    New changeset 6f1602dfa4d5 by INADA Naoki in branch 'default':
    Issue bpo-18219: Optimize csv.DictWriter for large number of columns.
    https://hg.python.org/cpython/rev/6f1602dfa4d5

    @methane
    Copy link
    Member

    methane commented Oct 21, 2016

    committed.

    @methane methane closed this as completed Oct 21, 2016
    @serhiy-storchaka
    Copy link
    Member

    Shouldn't docs changes and new tests be added to 3.5?

    @bitdancer
    Copy link
    Member

    Serhiy: I know you prefer applying test changes to the maint version, and I don't disagree, but there are others who prefer not to and we really don't have an official policy on it at this point. (We used to say no, a few years ago :)

    The doc change looks wrong to me. It looks like a rst source paragraph was split into separate lines instead of being a flowed paragraph in the source? I don't understand why that was done.

    @Mariatta
    Copy link
    Member

    Thanks David. I uploaded patch to address your concern with the docs.
    Can you please check?

    Serhiy, with regards to applying docs and test to 3.5, does that require a different patch than what I have?

    Thanks.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life performance Performance or resource usage stdlib Python modules in the Lib dir
    Projects
    None yet
    Development

    No branches or pull requests

    6 participants