classification
Title: csv reader utf-8 BOM error
Type: behavior Stage: resolved
Components: Documentation, Unicode Versions: Python 3.1, Python 3.2
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: docs@python Nosy List: W00D00, doerwalter, georg.brandl, r.david.murray, serhiy.storchaka
Priority: normal Keywords:

Created on 2009-10-22 10:46 by W00D00, last changed 2012-04-28 05:43 by serhiy.storchaka. This issue is now closed.

Files
File name Uploaded Description Edit
distances.csv W00D00, 2009-10-22 10:46 csv file with BOM
Messages (11)
msg94340 - (view) Author: Istvan Szirtes (W00D00) Date: 2009-10-22 10:46
The CSV module try to read a .csv file which is coded in utf-8 with utf-
8 BOM. 

The first row in the csv file is 
["value","vocal","vocal","vocal","vocal"]

in hex:
"value","vocal","vocal","vocal","vocal"

the reader can not read corectly the first row and if I try to seek up 
to 0 somewhere in the file I got an error like this:

['\ufeff"value"', 'vocal', 'vocal', 'vocal', 'vocal']

I think the csv reader is not seekable correctly.

I attached a test file for the bug and here is my code:

import codecs
import csv

InDistancesFile = codecs.open( '..\\distances.csv', 'r', encoding='utf-
8' )
InDistancesObj = csv.reader( InDistancesFile )

for Row in InDistancesObj:
    if Row[0] == '20':
        print(Row)
        break

InDistancesFile.seek(0)

for Row in InDistancesObj:
    print(Row)
msg94341 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2009-10-22 14:03
http://docs.python.org/library/csv.html#module-csv states:

This version of the csv module doesn’t support Unicode input. Also,
there are currently some issues regarding ASCII NUL characters.
Accordingly, all input should be UTF-8 or printable ASCII to be safe;
see the examples in section Examples. These restrictions will be removed
in the future.
msg94345 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2009-10-22 14:51
The restrictions were theoretically removed in 3.1, and the 3.1
documentation has been updated to reflect that.  If 3.1 CSV doesn't
handle unicode, then that is a bug.
msg94346 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2009-10-22 14:53
Then the solution should simply be to use "utf-8-sig" as the encoding,
instead of "utf-8".
msg94365 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2009-10-22 16:32
In that case we should update the docs.  Istvan, can you confirm that
this solves your problem?
msg94403 - (view) Author: Istvan Szirtes (W00D00) Date: 2009-10-24 08:13
Hi Everyone,

I have tried the "utf-8-sig" and it does not work in this case or 
rather I think not the csv module is wrong. The seek() does not work 
correctly in the csv file or object.

With "utf-8-sig" the file is opend correctly and the first row does not 
include the BOM problem. It is great. 
I am sorry I have not known this until now. (I am not a python expert 
yet :))

However, I have gote some misstake like this 'AFTE\ufeffVALUE".WAV' 
during my running script.

"AFTER" is a valid string in the given csv file but the BOM follows it.
This happens after when I seek up to "0" some times in the csv file.
And the string "aftevalue" LEAVE_HIGHWAY-E" is produced which is wrong.

My sollution is that I convert the csv object into a list after the 
file openeing:

        InDistancesFile = codecs.open( Root, 'r', encoding='utf-8' )
        txt = InDistancesFile.read()[1:] # to leave the BOM
        lines = txt.splitlines()[1:] # to leave the first row which is 
a header
        InDistancesObj = list(csv.reader( lines )) # convert the csv 
reader object into a simple list

Many thanks for your help,
Istvan
msg159483 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-04-27 19:00
I checked out. Files opened in "utf-8-sig" are seekable.

>>> open('test', 'w', encoding='utf-8-sig').write('qwerty\nйцукен\n')
>>> open('test', 'r', encoding="utf-8").read()
'\ufeffqwerty\nйцукен\n'
>>> open('test', 'r', encoding="utf-8-sig").read()
'qwerty\nйцукен\n'
>>> with open('test', 'r', encoding="utf-8-sig") as f:
...     print(ascii(f.readline()))
...     f.seek(0)
...     print(ascii(f.readline()))
... 
'qwerty\n'
0
'qwerty\n'


Should this issue be closed?
msg159490 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-04-27 20:12
Serhiy, the bug is about csv in particular.  Can you confirm that using utf-8-sig allows one to process a file with a bom using the csv module?
msg159494 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-04-27 21:26
I ran the script above (only replaced 'utf-8' on 'utf-8-sig') and did not see anything strange. I looked at the source (cvs.py and _cvs.c) and also did not see anything that could lead to this effect. If the bug exists, it in utf-8-sig codec and should be expressed in other cases. There is nothing special for csv.
msg159506 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-04-28 00:04
I wasn't sure which script you were referring to, so I checked it myself and got the same results as you: after the seek(0) on the file object opened with utf-8-sig, csv read all the lines in the file, including reading the header line correctly.

So, let's close this.
msg159509 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-04-28 05:43
I was referring to the script inlined in the message http://bugs.python.org/issue7185#msg94340 .
History
Date User Action Args
2012-04-28 05:43:56serhiy.storchakasetmessages: + msg159509
2012-04-28 00:04:08r.david.murraysetstatus: open -> closed
resolution: not a bug
messages: + msg159506

stage: needs patch -> resolved
2012-04-27 21:26:06serhiy.storchakasetmessages: + msg159494
2012-04-27 20:12:25r.david.murraysetmessages: + msg159490
2012-04-27 19:00:45serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg159483
2010-10-29 10:07:21adminsetassignee: georg.brandl -> docs@python
2010-05-20 20:32:22skip.montanarosetnosy: - skip.montanaro
2009-10-24 08:13:54W00D00setmessages: + msg94403
2009-10-22 18:05:16skip.montanarosetnosy: + skip.montanaro
2009-10-22 16:32:39r.david.murraysetassignee: georg.brandl
components: + Documentation
versions: + Python 3.2
nosy: + georg.brandl

messages: + msg94365
stage: test needed -> needs patch
2009-10-22 14:53:53doerwaltersetmessages: + msg94346
2009-10-22 14:51:15r.david.murraysetpriority: normal

nosy: + r.david.murray
messages: + msg94345

type: compile error -> behavior
stage: test needed
2009-10-22 14:03:53doerwaltersetnosy: + doerwalter
messages: + msg94341
2009-10-22 10:46:05W00D00create