Issue1606092
This issue tracker has been migrated to GitHub,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2006-11-30 14:46 by jettlogic, last changed 2022-04-11 14:56 by admin. This issue is now closed.
Messages (10) | |||
---|---|---|---|
msg30718 - (view) | Author: JettLogic (jettlogic) | Date: 2006-11-30 14:46 | |
The csv module does not accept data to write/read as anything other than ascii-or-utf-8 str, and the do-it-yourself example in the Python 2.5 Manual to write in another encoding is extremely clunky: 1) convert unicode to utf-8 2) use csv on utf-8 with cStringIO output 3) convert utf-8 to unicode 4) convert unicode to target encoding (may be utf-8...) So clunky as to be a bug - csv clearly can't handle unicode at all. The module functions are in dire need of either accepting unicode objects (letting the output stream worry about the encoding, like codecs.StreamWriter), or at the very least accepting data directly in a target encoding instead of roundabout utf-8. To read another encoding is a bit less onerous than writing: 1) wrap file to return utf-8 2) use csv, getting utf-8 output 3) convert utf-8 to unicode object Anyone willing to fix the csv module? |
|||
msg30719 - (view) | Author: Martin v. Löwis (loewis) * ![]() |
Date: 2006-12-03 10:34 | |
Are you willing to fix it? |
|||
msg30720 - (view) | Author: Marc-Andre Lemburg (lemburg) * ![]() |
Date: 2006-12-03 12:54 | |
It should be easy to provide a wrapper class which implements the above in plain Python. However, if noone volunteers to write such code, it's not going to happen. I've found that the builtin csv module is not flexible enough to deal with the often broken CSV data you typically find in practice, so perhaps adding a pure Python implementation which works with Unicode might prove to be a better approach. Unassigning the report, since I don't have time for this. |
|||
msg30721 - (view) | Author: Skip Montanaro (skip.montanaro) * ![]() |
Date: 2006-12-03 18:22 | |
I must admit I don't understand the criticism of the UnicodeReader and UnicodeWriter example classes in the module documentation. Sure, their implementations jump through some hoops, but that's so you don't have to. If you use them as written I believe their API's should be about the same as the csv.reader and csv.writer classes with the added improvement that the reader returns Unicode and the writer accepts Unicode. If your desire is to read and write Unicode why do you care that those objects are encoded using utf-8 in the file? Like Martin asked, are you willing to come up with better examples? Better yet, are you willing to provide a patch for the underlying extension module so it handles Unicode? Hint: I'm fairly certain that if it was trivial it would have been done by now. |
|||
msg30722 - (view) | Author: JettLogic (jettlogic) | Date: 2006-12-05 11:53 | |
Anyone know why it uses a C extension? The C code apparently appends fields to a writable byte buffer (so patching for unicode is impossible), reallocated as it grows. How much efficiency is gained by doing that, with its many lines of logic overhead, versus careful use of python strings? For montanaro, the UnicodeWriter with three coding conversions and a StringIO shows there is however much efficiency to be lost. Perhaps lemburg's suggestion of a pure-python re-implementation of _csv is the way to go. It does not look like a fun task, after adding in back-compatibility, benchmarks and tests, and I couldn't commit to it just yet. Are C->Py patches typically accepted? (assume quality code and comparable benchmarks) I'll have to leave it at that. If you leave this open, someone might take it up at some point. |
|||
msg30723 - (view) | Author: Skip Montanaro (skip.montanaro) * ![]() |
Date: 2006-12-06 15:56 | |
> Anyone know why it uses a C extension? Performance. A number of people (among them the authors of the _csv extension and me, a contributor to the Python csv module that fronts it) routinely read and write large (several megabytes) CSV files. We had all had experience with earlier Python-only CSV readers and writers. Their performance was just too poor. If you wrote a new module in Python that's compatible with the existing module -- and performed acceptably -- I see no reason it couldn't replace the current module. There are already a number of test cases. You'd certainly have to embellish them, but if the current set passed that would be a good indication your code was at least on the right track compatibility-wise. There are other reasons to desire a Python-based solution other than Unicode support. It would be much more likely that such a module would work with other Python implementations (e.g., PyPy, IronPython and Jython). Skip |
|||
msg72956 - (view) | Author: Mike Statkus (mstatkus) | Date: 2008-09-10 11:37 | |
Example of UnicodeWriter.writerow(self,row) presented in Python 2.5 Manual at section 9.1.5 (Examples on CSV module of standard library) does not correctly process rows containing not only strings, but also int type values, raising an attribute error. 1st line of code in UnicodeWriter.writerow: self.writer.writerow([s.encode("utf-8") for s in row]) tries to call .encode() method for s, that might be an int, not a string. A simple workaround is: self.writer.writerow([unicode(s).encode("utf-8") for s in row]) |
|||
msg95708 - (view) | Author: Craig McQueen (cmcqueen1975) | Date: 2009-11-25 02:48 | |
Is this still an open bug? I have the following code: lookup = {} csv_reader = csv.reader(codecs.open(lookup_file_name, 'r', 'utf-8')) for row in csv_reader: lookup[row[1]] = row[0] And it "appears to work" (it runs) using Python 2.6.2. So has this bug been fixed? |
|||
msg95709 - (view) | Author: Craig McQueen (cmcqueen1975) | Date: 2009-11-25 03:07 | |
I think I see now--it accepts Unicode input, but converts it back to bytes internally using the ASCII codec. So it works as long as the Unicode input contains on ASCII characters. That's a gotcha. It appears that it's been fixed in Python 3.x, judging by the documentation. |
|||
msg95717 - (view) | Author: R. David Murray (r.david.murray) * ![]() |
Date: 2009-11-25 12:17 | |
This is indeed fixed in Python 3. If someone wishes to step forward with patches for 2.7, they can reopen this bug, but I don't think it is worth the effort. |
History | |||
---|---|---|---|
Date | User | Action | Args |
2022-04-11 14:56:21 | admin | set | github: 44291 |
2009-11-25 12:17:27 | r.david.murray | set | status: open -> closed type: enhancement nosy: + r.david.murray messages: + msg95717 resolution: out of date stage: resolved |
2009-11-25 03:07:18 | cmcqueen1975 | set | messages:
+ msg95709 versions: + Python 2.6, Python 2.5, Python 2.4, Python 2.7 |
2009-11-25 02:48:48 | cmcqueen1975 | set | messages: + msg95708 |
2009-11-25 02:44:30 | cmcqueen1975 | set | nosy:
+ cmcqueen1975 |
2008-09-10 11:37:47 | mstatkus | set | nosy:
+ mstatkus messages: + msg72956 |
2006-11-30 14:46:19 | jettlogic | create |