Issue 7185: csv reader utf-8 BOM error

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/51434

classification

Title:	csv reader utf-8 BOM error
Type:	behavior	Stage:	resolved
Components:	Documentation, Unicode	Versions:	Python 3.1, Python 3.2

process

Status:	closed	Resolution:	not a bug
Dependencies:		Superseder:
Assigned To:	docs@python	Nosy List:	W00D00, doerwalter, georg.brandl, r.david.murray, serhiy.storchaka
Priority:	normal	Keywords:

Created on 2009-10-22 10:46 by W00D00, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
distances.csv	W00D00, 2009-10-22 10:46	csv file with BOM

Messages (11)
msg94340 - (view)	Author: Istvan Szirtes (W00D00)	Date: 2009-10-22 10:46
The CSV module try to read a .csv file which is coded in utf-8 with utf- 8 BOM. The first row in the csv file is ["value","vocal","vocal","vocal","vocal"] in hex: ď»ż"value","vocal","vocal","vocal","vocal" the reader can not read corectly the first row and if I try to seek up to 0 somewhere in the file I got an error like this: ['\ufeff"value"', 'vocal', 'vocal', 'vocal', 'vocal'] I think the csv reader is not seekable correctly. I attached a test file for the bug and here is my code: import codecs import csv InDistancesFile = codecs.open( '..\\distances.csv', 'r', encoding='utf- 8' ) InDistancesObj = csv.reader( InDistancesFile ) for Row in InDistancesObj: if Row[0] == '20': print(Row) break InDistancesFile.seek(0) for Row in InDistancesObj: print(Row)
msg94341 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2009-10-22 14:03
http://docs.python.org/library/csv.html#module-csv states: This version of the csv module doesn’t support Unicode input. Also, there are currently some issues regarding ASCII NUL characters. Accordingly, all input should be UTF-8 or printable ASCII to be safe; see the examples in section Examples. These restrictions will be removed in the future.
msg94345 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2009-10-22 14:51
The restrictions were theoretically removed in 3.1, and the 3.1 documentation has been updated to reflect that. If 3.1 CSV doesn't handle unicode, then that is a bug.
msg94346 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2009-10-22 14:53
Then the solution should simply be to use "utf-8-sig" as the encoding, instead of "utf-8".
msg94365 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2009-10-22 16:32
In that case we should update the docs. Istvan, can you confirm that this solves your problem?
msg94403 - (view)	Author: Istvan Szirtes (W00D00)	Date: 2009-10-24 08:13
Hi Everyone, I have tried the "utf-8-sig" and it does not work in this case or rather I think not the csv module is wrong. The seek() does not work correctly in the csv file or object. With "utf-8-sig" the file is opend correctly and the first row does not include the BOM problem. It is great. I am sorry I have not known this until now. (I am not a python expert yet :)) However, I have gote some misstake like this 'AFTE\ufeffVALUE".WAV' during my running script. "AFTER" is a valid string in the given csv file but the BOM follows it. This happens after when I seek up to "0" some times in the csv file. And the string "aftevalue" LEAVE_HIGHWAY-E" is produced which is wrong. My sollution is that I convert the csv object into a list after the file openeing: InDistancesFile = codecs.open( Root, 'r', encoding='utf-8' ) txt = InDistancesFile.read()[1:] # to leave the BOM lines = txt.splitlines()[1:] # to leave the first row which is a header InDistancesObj = list(csv.reader( lines )) # convert the csv reader object into a simple list Many thanks for your help, Istvan
msg159483 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-04-27 19:00
I checked out. Files opened in "utf-8-sig" are seekable. >>> open('test', 'w', encoding='utf-8-sig').write('qwerty\nйцукен\n') >>> open('test', 'r', encoding="utf-8").read() '\ufeffqwerty\nйцукен\n' >>> open('test', 'r', encoding="utf-8-sig").read() 'qwerty\nйцукен\n' >>> with open('test', 'r', encoding="utf-8-sig") as f: ... print(ascii(f.readline())) ... f.seek(0) ... print(ascii(f.readline())) ... 'qwerty\n' 0 'qwerty\n' Should this issue be closed?
msg159490 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2012-04-27 20:12
Serhiy, the bug is about csv in particular. Can you confirm that using utf-8-sig allows one to process a file with a bom using the csv module?
msg159494 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-04-27 21:26
I ran the script above (only replaced 'utf-8' on 'utf-8-sig') and did not see anything strange. I looked at the source (cvs.py and _cvs.c) and also did not see anything that could lead to this effect. If the bug exists, it in utf-8-sig codec and should be expressed in other cases. There is nothing special for csv.
msg159506 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2012-04-28 00:04
I wasn't sure which script you were referring to, so I checked it myself and got the same results as you: after the seek(0) on the file object opened with utf-8-sig, csv read all the lines in the file, including reading the header line correctly. So, let's close this.
msg159509 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-04-28 05:43
I was referring to the script inlined in the message http://bugs.python.org/issue7185#msg94340 .

History
Date	User	Action	Args
2022-04-11 14:56:54	admin	set	github: 51434
2012-04-28 05:43:56	serhiy.storchaka	set	messages: + msg159509
2012-04-28 00:04:08	r.david.murray	set	status: open -> closed resolution: not a bug messages: + msg159506 stage: needs patch -> resolved
2012-04-27 21:26:06	serhiy.storchaka	set	messages: + msg159494
2012-04-27 20:12:25	r.david.murray	set	messages: + msg159490
2012-04-27 19:00:45	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg159483
2010-10-29 10:07:21	admin	set	assignee: georg.brandl -> docs@python
2010-05-20 20:32:22	skip.montanaro	set	nosy: - skip.montanaro
2009-10-24 08:13:54	W00D00	set	messages: + msg94403
2009-10-22 18:05:16	skip.montanaro	set	nosy: + skip.montanaro
2009-10-22 16:32:39	r.david.murray	set	assignee: georg.brandl components: + Documentation versions: + Python 3.2 nosy: + georg.brandl messages: + msg94365 stage: test needed -> needs patch
2009-10-22 14:53:53	doerwalter	set	messages: + msg94346
2009-10-22 14:51:15	r.david.murray	set	priority: normal nosy: + r.david.murray messages: + msg94345 type: compile error -> behavior stage: test needed
2009-10-22 14:03:53	doerwalter	set	nosy: + doerwalter messages: + msg94341
2009-10-22 10:46:05	W00D00	create