Issue 4847: csv fails when file is opened in binary mode

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/49097

classification

Title:	csv fails when file is opened in binary mode
Type:	behavior	Stage:	commit review
Components:	Documentation	Versions:	Python 3.0, Python 3.1

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	r.david.murray	Nosy List:	benjamin.peterson, georg.brandl, gvanrossum, jaywalker, jdwhitley, pitrou, r.david.murray, r.david.murray, sjmachin, skip.montanaro, vstinner
Priority:	normal	Keywords:	patch

Created on 2009-01-05 17:03 by jaywalker, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
csv_doc.patch	vstinner, 2009-01-06 01:31
csv.diff	jdwhitley, 2009-03-09 05:11	Patch against python 3.1a1
issue4847-doc.patch	r.david.murray, 2009-04-02 03:52

Messages (33)
msg79165 - (view)	Author: (jaywalker)	Date: 2009-01-05 17:03
The following code from the documentation fails: ################# import csv reader = csv.reader(open("eggs.csv", "rb")) for row in reader: print(row) ##################### The output is: Traceback (most recent call last): File "<stdin>", line 1, in <module> File "<string>", line 3, in <module> _csv.Error: iterator should return strings, not bytes (did you open the file in text mode?) It works as expected in python 2.6
msg79166 - (view)	Author: STINNER Victor (vstinner) *	Date: 2009-01-05 17:05
Do you expect to be able to read CSV as bytes or just to fix the documentation example?
msg79170 - (view)	Author: (jaywalker)	Date: 2009-01-05 17:19
In an unlikely scenario: say one of the fields has an embedded \r. For instance "blahblah\r" is the value of the first column. Now open this file in text mode. What happens to this '\r' even before csv.reader sees it? If it remains intact, no problem. If it is converted to a newline in any platform, and that newline is not exactly \r, does this not introduce a problem? Just curious... Thanks. ----- Original Message ---- From: STINNER Victor <report@bugs.python.org> To: jaywalkie@yahoo.com Sent: Monday, January 5, 2009 12:05:59 PM Subject: [issue4847] csv fails when file is opened in binary mode STINNER Victor <victor.stinner@haypocalc.com> added the comment: Do you expect to be able to read CSV as bytes or just to fix the documentation example? ---------- nosy: +haypo _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue4847> _______________________________________
msg79171 - (view)	Author: STINNER Victor (vstinner) *	Date: 2009-01-05 17:23
> say one of the fields has an embedded \r. For instance "blahblah\r" is the > value of the first column. Now open this file in text mode. What happens to > this '\r' even before csv.reader sees it? I used rarely the CSV format, but it sounds strange to have a newline character in a column. Newlines characters (\r and \n) are reserved to mark the end of the line. Can you produce such file to test? :-) I guess that the csv modules does something like readline().split(";").
msg79172 - (view)	Author: (jaywalker)	Date: 2009-01-05 17:29
make it '\r\n', if you want. Such files can be easily generated on windows text editors, or even linux ones nowadays. Upon reading, if the file is opened in text mode, this will probably be converted to \n even on linux by python (I may be wrong). Thus, \r gets lost. Not a biggie, but you never know ;0) ----- Original Message ---- From: STINNER Victor <report@bugs.python.org> To: jaywalkie@yahoo.com Sent: Monday, January 5, 2009 12:23:45 PM Subject: [issue4847] csv fails when file is opened in binary mode STINNER Victor <victor.stinner@haypocalc.com> added the comment: > say one of the fields has an embedded \r. For instance "blahblah\r" is the > value of the first column. Now open this file in text mode. What happens to > this '\r' even before csv.reader sees it? I used rarely the CSV format, but it sounds strange to have a newline character in a column. Newlines characters (\r and \n) are reserved to mark the end of the line. Can you produce such file to test? :-) I guess that the csv modules does something like readline().split(";"). _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue4847> _______________________________________
msg79174 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2009-01-05 17:47
You can avoid the newline translation problem by using the newline parameter in open(). Set it to '' (the empty string) and any CR and LF characters should remain intact. As for the original problem, IMHO it is a documentation bug.
msg79176 - (view)	Author: (jaywalker)	Date: 2009-01-05 17:51
I think what you suggest makes most sense. Thanks. ----- Original Message ---- From: Antoine Pitrou <report@bugs.python.org> To: jaywalkie@yahoo.com Sent: Monday, January 5, 2009 12:47:21 PM Subject: [issue4847] csv fails when file is opened in binary mode Antoine Pitrou <pitrou@free.fr> added the comment: You can avoid the newline translation problem by using the newline parameter in open(). Set it to '' (the empty string) and any CR and LF characters should remain intact. As for the original problem, IMHO it is a documentation bug. ---------- assignee: -> georg.brandl components: +Documentation -Library (Lib) nosy: +georg.brandl, pitrou priority: -> normal _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue4847> _______________________________________
msg79224 - (view)	Author: STINNER Victor (vstinner) *	Date: 2009-01-06 01:31
Short patch fixing the examples + the description of reader() and writer() (remove the "b" flag sentence: it's wrong, we need unicode!). I hope that I didn't break the documentation (i tried "make text" and i didn't get any warning).
msg82661 - (view)	Author: John Machin (sjmachin)	Date: 2009-02-24 07:25
Sorry, folks, we've got an understanding problem here. CSV files are typically NOT created by text editors. They are created e.g. by "save as csv" from a spreadsheet program, or as an output option by some database query program. They can have just about any character in a field, including \r and \n. Fields containing those characters should be quoted (just like a comma) by the csv file producer. A csv reader should be capable of reproducing the original field division. Here for example is a dump of a little file I just created using Excel 2003: C:\devel\csv>\python26\python -c "print repr(open('book1.csv','rb').read())" 'Field1,"Field 2 has a\nvery long\nheading",Field3\r\n1.11,2.22,3.33\r\n' Inserting \n into a text field in Excel (using Alt-Enter) is a well-known user trick. Here's what we get from Python 2.6.1: C:\devel\csv>\python26\python -c "import csv; print repr(list(csv.reader(open('book1.csv','rb'))))" [['Field1', 'Field 2 has a\nvery long\nheading', 'Field3'], ['1.11', '2.22', '3.33']] and the same by design all the way back to Python 2.3's csv module and its ancestor, the ObjectCraft csv module. However with Python 3.0.1 we get: C:\devel\csv>\python30\python -c "import csv; print(repr(list(csv.reader(open('book1.csv','rb')))))" Traceback (most recent call last): File "<string>", line 1, in <module> _csv.Error: iterator should return strings, not bytes (did you open the file in text mode?) This sentence in the documentation is NOT an error: """If csvfile is a file object, it must be opened with the ‘b’ flag on platforms where that makes a difference.""" The problem IS a "biggie". This paragraph in the documentation (evidently introduced in 2.5) is rather confusing:"""The parser is quite strict with respect to multi-line quoted fields. Previously, if a line ended within a quoted field without a terminating newline character, a newline would be inserted into the returned field. This behavior caused problems when reading files which contained carriage return characters within fields. The behavior was changed to return the field without inserting newlines. As a consequence, if newlines embedded within fields are important, the input should be split into lines in a manner which preserves the newline characters.""" Some examples of what it is talking about would be a very good idea.
msg83356 - (view)	Author: Jervis Whitley (jdwhitley)	Date: 2009-03-09 05:11
Hi all, This patch takes the approach of assuming utf-8 format encoding for files opened with 'rb' directive. That is: 1. Check if each line is Unicode Or Bytes Type. 2. If Bytes, get char array reference to internal buffer. 3. use PyUnicode_FromString to create a new unicode object from the char* - This step assumes UTF-8. 4. get a Py_UNICODE reference to internal unicode object buffer and continue as before. Is this in the right direction at all? Cheers, Jervis
msg83357 - (view)	Author: John Machin (sjmachin)	Date: 2009-03-09 06:02
Before patching, could we discuss the requirements? There are two different concepts: (1) "text" file (assume that CR and/or LF are line terminators, and provide methods for accessing a line at a time) versus "binary" file (no such assumptions, no such access) (2) reading the file as a raw undecoded "bytes" file or as a decoded "str" file. Options for 3.X: (1) caller uses mode 'rb', is given bytes objects back. (2) caller uses mode 'rt' and provides an encoding, is given str objects back. IMPORTANT: Option 2 must NOT not read the file as a collection of "lines"; it must process it (conceptually at least) a character at a time so that embedded CR and/or LF are not taken to be row terminators. Following the line that 3.X line should do what's best, not what we used to do, the implication is that we choose option 2.
msg83359 - (view)	Author: John Machin (sjmachin)	Date: 2009-03-09 06:49
... and it looks like Option 2 might already almost be in place. Continuing with the previous example (book1.csv has embedded lone LFs): C:\devel\csv>\python30\python -c "import csv; print(repr(list(csv.reader(open('book1.csv','rt', encoding='ascii')))))" [['Field1', 'Field 2 has a\nvery long\nheading', 'Field3'], ['1.11', '2.22', '3.33']] Looks good. However consider book2.csv which has embedded CRLFs: C:\devel\csv>\python30\python -c "print(repr(open('book2.csv', 'rb').read()))" b'Field1,"Field 2 has a\r\nvery long\r\nheading",Field3\r\n1.11,2.22,3.33\r\n' This gives: C:\devel\csv>\python30\python -c "import csv; print(repr(list(csv.reader(open('book2.csv','rt', encoding='ascii')))))" [['Field1', 'Field 2 has a\nvery long\nheading', 'Field3'], ['1.11', '2.22', '3.33']] Not good. It should preserve ALL characters in the field.
msg83368 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2009-03-09 10:39
> Not good. It should preserve ALL characters in the field. Please look at the doc for open() and io.TextIOWrapper. The `newline` parameter defaults to None, which means universal newlines with newline translation. Setting to '' (yes, the empty string) enables universal newlines but disables newline translation (that it, it will split lines on all of ['\n', '\r', '\r\n'], but will leave these newlines intact rather than convert them to '\n'). However, I think csv should accept files opened in binary mode and be able to deal with line endings itself. How am I supposed to know the encoding of a CSV file? Surely Excel uses a defined, default encoding when exporting to CSV... that knowledge should be embedded in the csv module.
msg83370 - (view)	Author: John Machin (sjmachin)	Date: 2009-03-09 11:32
pitrou> Please look at the doc for open() and io.TextIOWrapper. The `newline` parameter defaults to None, which means universal newlines with newline translation. Setting to '' (yes, the empty string) enables universal newlines but disables newline translation ... I had already read it. I gave it a prize for "least intuitive arg in the language". So you plan to use that, reading "lines" instead of blocks? You'll still have to examine which CRs and LFs are embedded and which are line terminators. You might just as well use f.read(BLOCKSZ) and avoid having to insist that the user explicitly write ", newline=''". pitrou> However, I think csv should accept files opened in binary mode and be able to deal with line endings itself. How am I supposed to know the encoding of a CSV file? Surely Excel uses a defined, default encoding when exporting to CSV... that knowledge should be embedded in the csv module. Excel has no default, because the user has no option -- the defined encoding is "cp" + str(codepage_number_derived_from_locale), e.g. "cp1252". Likewise other software writing delimited data to text files will use (one of) the local legacy encoding(s). So: (i) mode='rb' and no encoding => caller gets bytes back and needs to do own decoding or (ii) mode='rb' and an encoding [which looks rather daft and is currently not possible] and the the caller gets str objects. Both of these are ugly -- hence my preference for the mode="rt" variety of solution. Do we really want the double hassle of both a str csv implementation and a bytes csv implementation?
msg83372 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2009-03-09 11:40
> I had already read it. I gave it a prize for "least intuitive arg in the > language". Please open a bug, then :) > So you plan to use that, reading "lines" instead of blocks? > You'll still have to examine which CRs and LFs are embedded and which > are line terminators. You might just as well use f.read(BLOCKSZ) and > avoid having to insist that the user explicitly write ", newline=''". Sorry, but who is "you" in that paragraph? The csv module currently accepts any iterator yielding lines of text, not only file objects. Changing this would be a major compatibility break. > Excel has no default, because the user has no option -- the defined > encoding is "cp" + str(codepage_number_derived_from_locale), e.g. > "cp1252". Then Excel-generated CSV files all use different encodings? Gasp :-(
msg83381 - (view)	Author: Skip Montanaro (skip.montanaro) *	Date: 2009-03-09 14:08
John> Options for 3.X: John> (1) caller uses mode 'rb', is given bytes objects back. John> (2) caller uses mode 'rt' and provides an encoding, is given str John> objects back. I believe #1 is the only correct option. If you open a file in text mode the low-level file reading code will unify all line endings to \n. I don't believe it's possible to override that behavior. Skip
msg83382 - (view)	Author: Skip Montanaro (skip.montanaro) *	Date: 2009-03-09 14:14
Antoine> However, I think csv should accept files opened in binary mode Antoine> and be able to deal with line endings itself. How am I supposed Antoine> to know the encoding of a CSV file? Surely Excel uses a Antoine> defined, default encoding when exporting to CSV... that Antoine> knowledge should be embedded in the csv module. In fact, the csv module actually requires files to be opened in binary mode but doesn't enforce that. That it works in all but a few corner cases is because so few CSV files contain fields containing embedded newlines. Why doesn't open() allow you to specify an encoding when opening in binary mode? Even though it would be unused by the File object, it would serve as an annotation for code using that open File object so they can do the appropriate bytes-to-unicode conversion. Skip
msg83383 - (view)	Author: Skip Montanaro (skip.montanaro) *	Date: 2009-03-09 14:18
Antoine> .... The csv module currently accepts any iterator yielding Antoine> lines of text, not only file objects.... If the iterator yields bytes and had an encoding attribute that would serve to allow the csv reader to perform the necessary conversion. It would be incumbent upon the user to specify the encoding. Skip
msg83385 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2009-03-09 14:40
[newlines translation] > I don't > believe it's possible to override that behavior. It is, see my comments above. > Why doesn't open() allow you to specify an encoding when opening in binary > mode? It would be more logical to pass the encoding argument to the csv reader object instead.
msg83388 - (view)	Author: Skip Montanaro (skip.montanaro) *	Date: 2009-03-09 14:59
Antoine> It would be more logical to pass the encoding argument to the Antoine> csv reader object instead. What should be the default?
msg83391 - (view)	Author: Skip Montanaro (skip.montanaro) *	Date: 2009-03-09 15:22
me> What should be the default? Scratch that. If the iterator passed to csv.reader is in a mode which will cause it to emit bytes instead of unicode objects the caller must give an encoding. The csv.reader code will then perform the necessary bytes-to-unicode conversion. If bytes are returned by the iterator but no encoding was given it should raise an exception (something standard? something new?). Skip
msg83424 - (view)	Author: Skip Montanaro (skip.montanaro) *	Date: 2009-03-10 10:42
This issue seems to have simply been overlooked when 3.0 was released. It should be fixed in the next round of 3.0 and 3.1 updates. Any feeback on the idea that the csv.reader constructor (and probably the DictReader and proposed NamedTupleReader constructors) should take an optional encoding argument? In fact, the writers should as well which would inform the writer how to encode the output when writing.
msg85048 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2009-04-01 16:50
I think it's good if it allowed passing in a binary file and an encoding, but I think it would be crazy if it wouldn't also take a text file. After all the primary purpose of a CSV file, edge cases notwithstanding, is to look like text to the end user in a text editor (as opposed to Excel spreadsheets, which look like binary gobbledygook unless opened with Excel). Are there any use cases for returning bytes strings instead of text strings? If so that should probably be another flag.
msg85108 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2009-04-01 23:02
I've added some unit tests for embedded newlines, and py3k csv passes (on linux at least) when newline='' is used. Unless someone can provide a test case that fails when newline='' is used, I propose we fix the documentation and leave the code alone.
msg85152 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2009-04-02 03:23
I'm attaching a proposed doc patch for comment. I replace mentions of 'rb' with "newline=''", including in the examples. I also deleted the unicode discussion (since CSV obviously handles unicode now) as well as the extensive unicode examples that are no longer relevant. (There are a couple other tweaks I made as I went along, but nothing substantial.)
msg85229 - (view)	Author: Skip Montanaro (skip.montanaro) *	Date: 2009-04-02 17:55
David> I've added some unit tests for embedded newlines, and py3k csv David> passes (on linux at least) when newline='' is used. Unless David> someone can provide a test case that fails when newline='' is David> used, I propose we fix the documentation and leave the code David> alone. This thread is getting a bit long. Can someone summarize how the expected usage of the csv module is supposed to change? If I read things correctly, instead of requiring (in the general case) that csv files be opened in binary mode, the requirement will be that they be opened with newline=''. This will thwart any attempts by the io module at newline translation, but since the file is still opened in text mode its contents will implicitly be Unicode (or Unicode translated to bytes with a specific encoding). That encoding will also be specified in the call to open(). Is this about correct? Do any test cases need to be updated or added? I notice that something called BytesIO is imported from io but not used. Were some test cases removed which used to involve that class or is that a 2to3 artifact? Skip
msg85234 - (view)	Author: Skip Montanaro (skip.montanaro) *	Date: 2009-04-02 18:29
David> I also deleted the unicode discussion (since CSV obviously David> handles unicode now) ... Maybe there should be a simple example showing use of the encoding parameter to open() to encode Unicode on write and decode to Unicode on read? Skip
msg85238 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2009-04-02 18:41
On Thu, 2 Apr 2009 at 17:55, Skip Montanaro wrote: > This thread is getting a bit long. Can someone summarize how the expected > usage of the csv module is supposed to change? If I read things correctly, > instead of requiring (in the general case) that csv files be opened in > binary mode, the requirement will be that they be opened with newline=''. > This will thwart any attempts by the io module at newline translation, but > since the file is still opened in text mode its contents will implicitly be > Unicode (or Unicode translated to bytes with a specific encoding). That > encoding will also be specified in the call to open(). I believe that is an accurate summary. > Is this about correct? Do any test cases need to be updated or added? I > notice that something called BytesIO is imported from io but not used. Were > some test cases removed which used to involve that class or is that a 2to3 > artifact? I will look in to this, and add an encoding example to the docs as you suggest in another email. --David
msg85336 - (view)	Author: Benjamin Peterson (benjamin.peterson) *	Date: 2009-04-03 21:56
So this is a doc bug? If so, need it still block tomorrow's release?
msg85352 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2009-04-04 00:25
Nope. Sorry I forgot to change the priority.
msg85362 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2009-04-04 01:31
> Is this about correct? Do any test cases need to be updated or added? I > notice that something called BytesIO is imported from io but not used. Were > some test cases removed which used to involve that class or is that a 2to3 > artifact? The set of test cases is almost exactly parallel between 2.7 and 3.1. 3.1 adds an additional DictReader test, and refactors one set of tests to make the code simpler. Other than that, it is just a matter of replacing the file opening and closing code that uses binary mode with 'with' statements where the open uses newline=''. BytesIO is not used so would appear to be a conversion artifact.
msg85363 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2009-04-04 01:33
Oh, yeah, other 3.1 differences are that the unicode test is uncommented and updated, and a test is added to make sure nulls are handled correctly.
msg85365 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2009-04-04 01:48
Doc patch applied in r71116.

History
Date	User	Action	Args
2022-04-11 14:56:43	admin	set	github: 49097
2009-04-05 16:28:33	georg.brandl	link	issue5455 superseder
2009-04-04 01:48:02	r.david.murray	set	status: open -> closed resolution: fixed messages: + msg85365
2009-04-04 01:33:54	r.david.murray	set	messages: + msg85363
2009-04-04 01:31:38	r.david.murray	set	messages: + msg85362
2009-04-04 00:25:50	r.david.murray	set	priority: release blocker -> normal messages: + msg85352
2009-04-03 21:56:45	benjamin.peterson	set	nosy: + benjamin.peterson messages: + msg85336
2009-04-02 18:41:47	r.david.murray-old	set	nosy: + r.david.murray-old messages: + msg85238
2009-04-02 18:29:50	skip.montanaro	set	messages: + msg85234
2009-04-02 17:55:26	skip.montanaro	set	messages: + msg85229
2009-04-02 03:52:40	r.david.murray	set	files: + issue4847-doc.patch assignee: georg.brandl -> r.david.murray
2009-04-02 03:51:44	r.david.murray	set	files: - issue4847-doc.patch
2009-04-02 03:23:14	r.david.murray	set	files: + issue4847-doc.patch type: behavior messages: + msg85152 stage: commit review
2009-04-01 23:02:09	r.david.murray	set	nosy: + r.david.murray messages: + msg85108
2009-04-01 16:50:03	gvanrossum	set	nosy: + gvanrossum messages: + msg85048
2009-03-10 10:42:56	skip.montanaro	set	priority: normal -> release blocker messages: + msg83424
2009-03-09 15:22:14	skip.montanaro	set	messages: + msg83391
2009-03-09 14:59:37	skip.montanaro	set	messages: + msg83388
2009-03-09 14:40:34	pitrou	set	messages: + msg83385
2009-03-09 14:18:06	skip.montanaro	set	messages: + msg83383
2009-03-09 14:14:11	skip.montanaro	set	messages: + msg83382
2009-03-09 14:08:25	skip.montanaro	set	messages: + msg83381
2009-03-09 11:40:12	pitrou	set	messages: + msg83372
2009-03-09 11:32:14	sjmachin	set	messages: + msg83370
2009-03-09 10:39:33	pitrou	set	messages: + msg83368
2009-03-09 06:49:09	sjmachin	set	messages: + msg83359 versions: + Python 3.1
2009-03-09 06:02:14	sjmachin	set	nosy: + skip.montanaro messages: + msg83357
2009-03-09 05:11:31	jdwhitley	set	files: + csv.diff nosy: + jdwhitley messages: + msg83356
2009-02-24 07:25:42	sjmachin	set	nosy: + sjmachin messages: + msg82661
2009-01-06 01:31:13	vstinner	set	files: + csv_doc.patch keywords: + patch messages: + msg79224
2009-01-05 17:51:28	jaywalker	set	messages: + msg79176
2009-01-05 17:47:20	pitrou	set	priority: normal assignee: georg.brandl messages: + msg79174 components: + Documentation, - Library (Lib) nosy: + georg.brandl, pitrou
2009-01-05 17:29:53	jaywalker	set	messages: + msg79172
2009-01-05 17:23:43	vstinner	set	messages: + msg79171
2009-01-05 17:19:12	jaywalker	set	messages: + msg79170
2009-01-05 17:05:58	vstinner	set	nosy: + vstinner messages: + msg79166
2009-01-05 17:03:24	jaywalker	create