-
-
Notifications
You must be signed in to change notification settings - Fork 29.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
csv.DictWriter is slow when writing files with large number of columns #62419
Comments
_dict_to_list method of the csv.DictWriter objects created with extrasaction="raise" uses look-up in the list of field names to check if current row has any unknown fields. This results in O(n^2) execution time and is very slow if there are a lot of columns in a CSV file (in hundreds or thousands). Replacing look-up in a list with a look-up in a set solves the issue (see the attached patch). |
I think there is no need in public fieldset property. Just use private self._fieldset field in private _dict_to_list() method. |
Any way is fine with me. If you prefer to avoid having public filedset property, please use the attached patch. |
What is the purpose in touching fieldnames, either in tuple-izing it or in making it private and wrapped with a property. If someone wants to modify it, that is up to them. In any case, this change is not germane to the issue and could break code, so I would not make it. wrong_fields could be calculated with That said, in 3.x, replacing wrong_fields = <long expression> Does test/text_cvs have tests for DictWriter, both with and without rowdict errors? If so, or if added, I would be willing to commit a patch that simply added ._fieldset and used it as above for a set difference. Also, if you have not done so yet, please go to |
Wrapping the fieldnames property and tupleizing it guarantees that fieldnames and _fieldset fields are consistent.
In 2.x the list comprehension version is faster than the set difference version. In 3.x the set difference is slightly faster (maybe 10% faster). However, list comprehension works both in 2.x and 3.x, while set difference requires different code for them. Hence I prefer sticking with list comprehension.
|
Note that set operations on dict views work with lists, too. So the only change necessary is to replace wrong_fields = [k for k in rowdict if k not in self.fieldnames] with wrong_fields = rowdict.keys() - self.filenames (A backport to 2.7 would need to replace keys() with viewkeys()) |
Peter, thank you for letting me know that views work with list, I was not aware of this. This is indeed the best solution and it also keeps the DictWriter interface unchanged. Terry, attached patch contains the DictWriter change and a test case in test_csv.py. |
I came across this problem today when I was using a 1000+ column CSV from a client. It was taking about 15 minutes to process each file. I found the problem and made this change: # wrong_fields = [k for k in rowdict if k not in self.fieldnames]
wrong_fields = set(rowdict.keys()) - set(self.fieldnames) And my processing time went down to 12 seconds per file -- a 75x speedup. It's kind of sad that this change has been waiting for over three years when it is so simple. Any chance we could make one of the acceptable code changes and release it? |
Hello, please review my patch. I used set subtraction to calculate wrong_fields, added more test cases, and clarify documentation with regards to extrasaction parameter. Please let me know if this works. Thanks :) |
Fabulous. Looks great. Let's ship! It is not the *optimal* fix for 3.x platforms. A better fix would calculate the set of fieldnames only once in __init__ (or only as often as fieldnames is changed). But I stress that it is a robust change that works in versions 2.7 through 3.x for sure. And it is *way* better than the alternative of searching a list. |
Thanks Hugh, Are you thinking of something like the following? class DictWriter:
def __init__(self, f, fieldnames, restval="", extrasaction="raise",
dialect="excel", *args, **kwds):
self._fieldnames = fieldnames # list of keys for the dict
self._fieldnames_set = set(self._fieldnames)
@property
def fieldnames(self):
return self._fieldnames
@fieldnames.setter
def fieldnames(self, value):
self._fieldnames = value
self._fieldnames_set = set(self._fieldnames)
def _dict_to_list(self, rowdict):
if self.extrasaction == "raise":
wrong_fields = rowdict.keys() - self._fieldnames_set
... If so, I can work on another patch. |
Mariatta: Yes, that is what I was thinking of. That takes my 12 execution time down to 10 seconds. (Or, at least, a fix I did of this nature had that effect -- I have not timed your patch but it should be the same.) |
Thanks, Hugh. |
LGTM, Thanks Mariatta. |
issue18219v6.patch: LGTM, but I added a minor PEP-8 comment. INADA Naoki: "LGTM, Thanks Mariatta. (But one more LGTM from coredev is required for commit)" If you are confident (ex: if the change is simple, like this one), you can push it directly. |
Inada-san, Victor, thank you. Here is the updated patch. |
My mentor (Yury) prohibit it while I'm beginner. |
Oh right, trust your mentor more than me ;-) |
New changeset 1928074e6519 by INADA Naoki in branch '3.6': New changeset 6f1602dfa4d5 by INADA Naoki in branch 'default': |
committed. |
Shouldn't docs changes and new tests be added to 3.5? |
Serhiy: I know you prefer applying test changes to the maint version, and I don't disagree, but there are others who prefer not to and we really don't have an official policy on it at this point. (We used to say no, a few years ago :) The doc change looks wrong to me. It looks like a rst source paragraph was split into separate lines instead of being a flowed paragraph in the source? I don't understand why that was done. |
Thanks David. I uploaded patch to address your concern with the docs. Serhiy, with regards to applying docs and test to 3.5, does that require a different patch than what I have? Thanks. |
Misc/NEWS
so that it is managed by towncrier #552Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: