classification
Title: io.TextIOWrapper does not handle UTF-8 encoded streams correctly
Type: behavior Stage: resolved
Components: IO Versions: Python 2.7
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: Christoph.Rauch, r.david.murray
Priority: normal Keywords:

Created on 2013-01-31 15:28 by Christoph.Rauch, last changed 2013-01-31 15:42 by Christoph.Rauch. This issue is now closed.

Files
File name Uploaded Description Edit
utf-8-encoded.csv Christoph.Rauch, 2013-01-31 15:28 csv test-file for utf8 io.TextIOWrapper problem
Messages (3)
msg181028 - (view) Author: Christoph Rauch (Christoph.Rauch) Date: 2013-01-31 15:28
I have uncovered a strange behavior in io.TextIOWrapper which I think is a bug.

#!/usr/bin/env python
# encoding: utf-8

import csv 
import io
                                                                                                                                                                                                              
raw_file = io.FileIO('utf-8-encoded.csv', 'rb')
stream = io.BufferedReader(raw_file)
stream = io.TextIOWrapper(stream, encoding="UTF-8")
reader = csv.reader(stream, delimiter=";")

cells = 0 

for row in reader:
    # Cells should contain 4 Unicode characters.
    assert all([len(cell.decode('utf-8')) == 4 for cell in row]), row 
    cells += len(row)

assert cells == 210, cells

This produces a not very useful:

Traceback (most recent call last):
  File "utf8-textio-test.py", line 15, in <module>
    for row in reader:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-4: ordinal not in range(128)

The only way to let it *not* crash is to set encoding to ascii and errors to ignore, but this clears out all the characters with ord>128, clearly not useful as well, so I hope this behavior is not intended.

I appended a file with which to test this problem.
msg181030 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-01-31 15:37
As noted in the documentation, the csv module in 2.7 does not handle unicode.  You'll have to switch to python3 if you want unicode support in csv.
msg181031 - (view) Author: Christoph Rauch (Christoph.Rauch) Date: 2013-01-31 15:42
Thanks for the information. Will work around that. Miss-diagnosed the problem.
History
Date User Action Args
2013-01-31 15:42:47Christoph.Rauchsetmessages: + msg181031
2013-01-31 15:37:41r.david.murraysetstatus: open -> closed

nosy: + r.david.murray
messages: + msg181030

resolution: not a bug
stage: resolved
2013-01-31 15:28:27Christoph.Rauchcreate