classification
Title: csv module broken for unicode
Type: enhancement Stage: resolved
Components: Unicode Versions: Python 2.4, Python 2.7, Python 2.6, Python 2.5
process
Status: closed Resolution: out of date
Dependencies: Superseder:
Assigned To: Nosy List: cmcqueen1975, jettlogic, lemburg, loewis, mstatkus, r.david.murray, skip.montanaro
Priority: normal Keywords:

Created on 2006-11-30 14:46 by jettlogic, last changed 2009-11-25 12:17 by r.david.murray. This issue is now closed.

Messages (10)
msg30718 - (view) Author: JettLogic (jettlogic) Date: 2006-11-30 14:46
The csv module does not accept data to write/read as anything other than ascii-or-utf-8 str, and the do-it-yourself example in the Python 2.5 Manual to write in another encoding is extremely clunky: 

1) convert unicode to utf-8
2) use csv on utf-8 with cStringIO output
3) convert utf-8 to unicode
4) convert unicode to target encoding (may be utf-8...)

So clunky as to be a bug - csv clearly can't handle unicode at all.  The module functions are in dire need of either accepting unicode objects (letting the output stream worry about the encoding, like codecs.StreamWriter), or at the very least accepting data directly in a target encoding instead of roundabout utf-8.

To read another encoding is a bit less onerous than writing:

1) wrap file to return utf-8
2) use csv, getting utf-8 output
3) convert utf-8 to unicode object

Anyone willing to fix the csv module?
msg30719 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2006-12-03 10:34
Are you willing to fix it?
msg30720 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2006-12-03 12:54
It should be easy to provide a wrapper class which implements the above in plain Python.

However, if noone volunteers to write such code, it's not going to happen. 

I've found that the builtin csv module is not flexible enough to deal with the often broken CSV data you typically find in practice, so perhaps adding a pure Python implementation which works with Unicode might prove to be a better approach.

Unassigning the report, since I don't have time for this.
msg30721 - (view) Author: Skip Montanaro (skip.montanaro) * (Python triager) Date: 2006-12-03 18:22
I must admit I don't understand the criticism of the UnicodeReader and UnicodeWriter example classes in the module documentation.  Sure, their implementations jump through some hoops, but that's so you don't have to.  If you use them as written I believe their API's should be about the same as the csv.reader and csv.writer classes with the added improvement that the reader returns Unicode and the writer accepts Unicode.  If your desire is to read and write Unicode why do you care that those objects are encoded using utf-8 in the file?

Like Martin asked, are you willing to come up with better examples?  Better yet, are you willing to provide a patch for the underlying extension module so it handles Unicode?  Hint: I'm fairly certain that if it was trivial it would have been done by now.
msg30722 - (view) Author: JettLogic (jettlogic) Date: 2006-12-05 11:53
Anyone know why it uses a C extension?  The C code apparently appends fields to a writable byte buffer (so patching for unicode is impossible), reallocated as it grows.  How much efficiency is gained by doing that, with its many lines of logic overhead, versus careful use of python strings?  For montanaro, the UnicodeWriter with three coding conversions and a StringIO shows there is however much efficiency to be lost.

Perhaps lemburg's suggestion of a pure-python re-implementation of _csv is the way to go.  It does not look like a fun task, after adding in back-compatibility, benchmarks and tests, and I couldn't commit to it just yet.  Are C->Py patches typically accepted?  (assume quality code and comparable benchmarks)

I'll have to leave it at that.  If you leave this open, someone might take it up at some point.
msg30723 - (view) Author: Skip Montanaro (skip.montanaro) * (Python triager) Date: 2006-12-06 15:56
> Anyone know why it uses a C extension?

Performance.  A number of people (among them the authors of the _csv
extension and me, a contributor to the Python csv module that fronts
it) routinely read and write large (several megabytes) CSV files.  We
had all had experience with earlier Python-only CSV readers and
writers.  Their performance was just too poor.

If you wrote a new module in Python that's compatible with the
existing module -- and performed acceptably -- I see no reason it
couldn't replace the current module.  There are already a number
of test cases.  You'd certainly have to embellish them, but if the
current set passed that would be a good indication your code was at
least on the right track compatibility-wise.

There are other reasons to desire a Python-based solution other
than Unicode support.  It would be much more likely that such a
module would work with other Python implementations (e.g., PyPy,
IronPython and Jython).

Skip
msg72956 - (view) Author: Mike Statkus (mstatkus) Date: 2008-09-10 11:37
Example of UnicodeWriter.writerow(self,row) presented in Python 2.5
Manual at section 9.1.5 (Examples on CSV module of standard library)
does not correctly process rows containing not only strings, but also
int type values, raising an attribute error. 
1st line of code in UnicodeWriter.writerow:
self.writer.writerow([s.encode("utf-8") for s in row])
tries to call .encode() method for s, that might be an int, not a
string. A simple workaround is:
self.writer.writerow([unicode(s).encode("utf-8") for s in row])
msg95708 - (view) Author: Craig McQueen (cmcqueen1975) Date: 2009-11-25 02:48
Is this still an open bug? I have the following code:

    lookup = {}
    csv_reader = csv.reader(codecs.open(lookup_file_name, 'r', 'utf-8'))
    for row in csv_reader:
        lookup[row[1]] = row[0]

And it "appears to work" (it runs) using Python 2.6.2. So has this bug
been fixed?
msg95709 - (view) Author: Craig McQueen (cmcqueen1975) Date: 2009-11-25 03:07
I think I see now--it accepts Unicode input, but converts it back to
bytes internally using the ASCII codec. So it works as long as the
Unicode input contains on ASCII characters. That's a gotcha.

It appears that it's been fixed in Python 3.x, judging by the documentation.
msg95717 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2009-11-25 12:17
This is indeed fixed in Python 3.  If someone wishes to step forward
with patches for 2.7, they can reopen this bug, but I don't think it is
worth the effort.
History
Date User Action Args
2009-11-25 12:17:27r.david.murraysetstatus: open -> closed

type: enhancement

nosy: + r.david.murray
messages: + msg95717
resolution: out of date
stage: resolved
2009-11-25 03:07:18cmcqueen1975setmessages: + msg95709
versions: + Python 2.6, Python 2.5, Python 2.4, Python 2.7
2009-11-25 02:48:48cmcqueen1975setmessages: + msg95708
2009-11-25 02:44:30cmcqueen1975setnosy: + cmcqueen1975
2008-09-10 11:37:47mstatkussetnosy: + mstatkus
messages: + msg72956
2006-11-30 14:46:19jettlogiccreate