➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: csv fails when file is opened in binary mode
Type: behavior Stage: commit review
Components: Documentation Versions: Python 3.0, Python 3.1
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: r.david.murray Nosy List: benjamin.peterson, georg.brandl, gvanrossum, jaywalker, jdwhitley, pitrou, r.david.murray, r.david.murray, sjmachin, skip.montanaro, vstinner
Priority: normal Keywords: patch

Created on 2009-01-05 17:03 by jaywalker, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
csv_doc.patch vstinner, 2009-01-06 01:31
csv.diff jdwhitley, 2009-03-09 05:11 Patch against python 3.1a1
issue4847-doc.patch r.david.murray, 2009-04-02 03:52
Messages (33)
msg79165 - (view) Author: (jaywalker) Date: 2009-01-05 17:03
The following code from the documentation fails:
#################
import csv
reader = csv.reader(open("eggs.csv", "rb"))
for row in reader:
    print(row)
#####################
The output is:
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<string>", line 3, in <module>
_csv.Error: iterator should return strings, not bytes (did you open the
file in text mode?)


It works as expected in python 2.6
msg79166 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2009-01-05 17:05
Do you expect to be able to read CSV as bytes or just to fix the 
documentation example?
msg79170 - (view) Author: (jaywalker) Date: 2009-01-05 17:19
In an unlikely scenario:
say one of the fields has an embedded \r. For instance "blahblah\r" is the value of the first column. Now open this file in text mode. What happens to this '\r' even before csv.reader sees it? If it remains intact, no problem. If it is converted to a newline in any platform, and that newline is not exactly \r, does this not introduce a problem?

Just curious...

Thanks.

----- Original Message ----
From: STINNER Victor <report@bugs.python.org>
To: jaywalkie@yahoo.com
Sent: Monday, January 5, 2009 12:05:59 PM
Subject: [issue4847] csv fails when file is opened in binary mode

STINNER Victor <victor.stinner@haypocalc.com> added the comment:

Do you expect to be able to read CSV as bytes or just to fix the 
documentation example?

----------
nosy: +haypo

_______________________________________
Python tracker <report@bugs.python.org>
<http://bugs.python.org/issue4847>
_______________________________________
msg79171 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2009-01-05 17:23
> say one of the fields has an embedded \r. For instance "blahblah\r" is the
> value of the first column. Now open this file in text mode. What happens to
> this '\r' even before csv.reader sees it?

I used rarely the CSV format, but it sounds strange to have a newline 
character in a column. Newlines characters (\r and \n) are reserved to mark 
the end of the line. Can you produce such file to test? :-)

I guess that the csv modules does something like readline().split(";").
msg79172 - (view) Author: (jaywalker) Date: 2009-01-05 17:29
make it '\r\n', if you want. Such files can be easily generated on windows text editors, or even linux ones nowadays. Upon reading, if the file is opened in text mode, this will probably be converted to \n even on linux by python (I may be wrong). Thus, \r gets lost.

Not a biggie, but you never know ;0)

----- Original Message ----
From: STINNER Victor <report@bugs.python.org>
To: jaywalkie@yahoo.com
Sent: Monday, January 5, 2009 12:23:45 PM
Subject: [issue4847] csv fails when file is opened in binary mode

STINNER Victor <victor.stinner@haypocalc.com> added the comment:

> say one of the fields has an embedded \r. For instance "blahblah\r" is the
> value of the first column. Now open this file in text mode. What happens to
> this '\r' even before csv.reader sees it?

I used rarely the CSV format, but it sounds strange to have a newline 
character in a column. Newlines characters (\r and \n) are reserved to mark 
the end of the line. Can you produce such file to test? :-)

I guess that the csv modules does something like readline().split(";").

_______________________________________
Python tracker <report@bugs.python.org>
<http://bugs.python.org/issue4847>
_______________________________________
msg79174 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-01-05 17:47
You can avoid the newline translation problem by using the newline
parameter in open(). Set it to '' (the empty string) and any CR and LF
characters should remain intact.

As for the original problem, IMHO it is a documentation bug.
msg79176 - (view) Author: (jaywalker) Date: 2009-01-05 17:51
I think what you suggest makes most sense. 
Thanks.

----- Original Message ----
From: Antoine Pitrou <report@bugs.python.org>
To: jaywalkie@yahoo.com
Sent: Monday, January 5, 2009 12:47:21 PM
Subject: [issue4847] csv fails when file is opened in binary mode

Antoine Pitrou <pitrou@free.fr> added the comment:

You can avoid the newline translation problem by using the newline
parameter in open(). Set it to '' (the empty string) and any CR and LF
characters should remain intact.

As for the original problem, IMHO it is a documentation bug.

----------
assignee:  -> georg.brandl
components: +Documentation -Library (Lib)
nosy: +georg.brandl, pitrou
priority:  -> normal

_______________________________________
Python tracker <report@bugs.python.org>
<http://bugs.python.org/issue4847>
_______________________________________
msg79224 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2009-01-06 01:31
Short patch fixing the examples + the description of reader() and 
writer() (remove the "b" flag sentence: it's wrong, we need unicode!). 
I hope that I didn't break the documentation (i tried "make text" and 
i didn't get any warning).
msg82661 - (view) Author: John Machin (sjmachin) Date: 2009-02-24 07:25
Sorry, folks, we've got an understanding problem here. CSV files are
typically NOT created by text editors. They are created e.g. by "save as
csv" from a spreadsheet program, or as an output option by some database
query program. They can have just about any character in a field,
including \r and \n. Fields containing those characters should be quoted
(just like a comma) by the csv file producer. A csv reader should be
capable of reproducing the original field division. Here for example is
a dump of a little file I just created using Excel 2003:

C:\devel\csv>\python26\python -c "print repr(open('book1.csv','rb').read())"
'Field1,"Field 2 has a\nvery long\nheading",Field3\r\n1.11,2.22,3.33\r\n'

Inserting \n into a text field in Excel (using Alt-Enter) is a
well-known user trick.

Here's what we get from Python 2.6.1:
C:\devel\csv>\python26\python -c "import csv; print
repr(list(csv.reader(open('book1.csv','rb'))))"
[['Field1', 'Field 2 has a\nvery long\nheading', 'Field3'], ['1.11',
'2.22', '3.33']]
and the same by design all the way back to Python 2.3's csv module and
its ancestor, the ObjectCraft csv module.

However with Python 3.0.1 we get:
C:\devel\csv>\python30\python -c "import csv;
print(repr(list(csv.reader(open('book1.csv','rb')))))"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
_csv.Error: iterator should return strings, not bytes (did you open the
file in text mode?)

This sentence in the documentation is NOT an error: """If csvfile is a
file object, it must be opened with the ‘b’ flag on platforms where that
makes a difference."""

The problem *IS* a "biggie".

This paragraph in the documentation (evidently introduced in 2.5) is
rather confusing:"""The parser is quite strict with respect to
multi-line quoted fields. Previously, if a line ended within a quoted
field without a terminating newline character, a newline would be
inserted into the returned field. This behavior caused problems when
reading files which contained carriage return characters within fields.
The behavior was changed to return the field without inserting newlines.
As a consequence, if newlines embedded within fields are important, the
input should be split into lines in a manner which preserves the newline
characters.""" Some examples of what it is talking about would be a very
good idea.
msg83356 - (view) Author: Jervis Whitley (jdwhitley) Date: 2009-03-09 05:11
Hi all,

This patch takes the approach of assuming utf-8 format encoding
for files opened with 'rb' directive. 

That is:

1. Check if each line is Unicode Or Bytes Type.
2. If Bytes, get char array reference to internal buffer.
3. use PyUnicode_FromString to create a new unicode object from the
char* - This step assumes UTF-8.
4. get a Py_UNICODE reference to internal unicode object buffer and 
   continue as before.

Is this in the right direction at all?

Cheers,

Jervis
msg83357 - (view) Author: John Machin (sjmachin) Date: 2009-03-09 06:02
Before patching, could we discuss the requirements?

There are two different concepts:
(1) "text" file (assume that CR and/or LF are line terminators, and
provide methods for accessing a line at a time) versus "binary" file (no
such assumptions, no such access)
(2) reading the file as a raw undecoded "bytes" file or as a decoded
"str" file.

Options for 3.X:
(1) caller uses mode 'rb', is given bytes objects back.
(2) caller uses mode 'rt' and provides an encoding, is given str objects
back.
IMPORTANT: Option 2 must NOT not read the file as a collection of
"lines"; it must process it (conceptually at least) a character at a
time so that embedded CR and/or LF are not taken to be row terminators.

Following the line that 3.X line should do what's best, not what we used
to do, the implication is that we choose option 2.
msg83359 - (view) Author: John Machin (sjmachin) Date: 2009-03-09 06:49
... and it looks like Option 2 might already *almost* be in place.
Continuing with the previous example (book1.csv has embedded lone LFs):

C:\devel\csv>\python30\python -c "import csv;
print(repr(list(csv.reader(open('book1.csv','rt', encoding='ascii')))))"
[['Field1', 'Field 2 has a\nvery long\nheading', 'Field3'], ['1.11',
'2.22', '3.33']]

Looks good. However consider book2.csv which has embedded CRLFs:
C:\devel\csv>\python30\python -c "print(repr(open('book2.csv',
'rb').read()))"
b'Field1,"Field 2 has a\r\nvery
long\r\nheading",Field3\r\n1.11,2.22,3.33\r\n'

This gives:
C:\devel\csv>\python30\python -c "import csv;
print(repr(list(csv.reader(open('book2.csv','rt', encoding='ascii')))))"
[['Field1', 'Field 2 has a\nvery long\nheading', 'Field3'], ['1.11',
'2.22', '3.33']]

Not good. It should preserve ALL characters in the field.
msg83368 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-03-09 10:39
> Not good. It should preserve ALL characters in the field.

Please look at the doc for open() and io.TextIOWrapper. The `newline`
parameter defaults to None, which means universal newlines with newline
translation. Setting to '' (yes, the empty string) enables universal
newlines but disables newline translation (that it, it will split lines
on all of ['\n', '\r', '\r\n'], but will leave these newlines intact
rather than convert them to '\n').

However, I think csv should accept files opened in binary mode and be
able to deal with line endings itself. How am I supposed to know the
encoding of a CSV file? Surely Excel uses a defined, default encoding
when exporting to CSV... that knowledge should be embedded in the csv
module.
msg83370 - (view) Author: John Machin (sjmachin) Date: 2009-03-09 11:32
pitrou> Please look at the doc for open() and io.TextIOWrapper. The
`newline` parameter defaults to None, which means universal newlines
with newline translation. Setting to '' (yes, the empty string) enables
universal newlines but disables newline translation ...

I had already read it. I gave it a prize for "least intuitive arg in the
language". So you plan to use that, reading "lines" instead of blocks?
You'll still have to examine which CRs and LFs are embedded and which
are line terminators. You might just as well use f.read(BLOCKSZ) and
avoid having to insist that the user explicitly write ", newline=''".

pitrou> However, I think csv should accept files opened in binary mode
and be able to deal with line endings itself. How am I supposed to know
the encoding of a CSV file? Surely Excel uses a defined, default
encoding when exporting to CSV... that knowledge should be embedded in
the csv module.

Excel has no default, because the user has no option -- the defined
encoding is "cp" + str(codepage_number_derived_from_locale), e.g.
"cp1252". Likewise other software writing delimited data to text files
will use (one of) the local legacy encoding(s).

So: (i) mode='rb' and no encoding => caller gets bytes back and needs to
do own decoding or (ii) mode='rb' and an encoding [which looks rather
daft and is currently not possible] and the the caller gets str objects.
Both of these are ugly -- hence my preference for the mode="rt" variety
of solution. Do we really want the double hassle of both a str csv
implementation and a bytes csv implementation?
msg83372 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-03-09 11:40
> I had already read it. I gave it a prize for "least intuitive arg in the
> language".

Please open a bug, then :)

> So you plan to use that, reading "lines" instead of blocks?
> You'll still have to examine which CRs and LFs are embedded and which
> are line terminators. You might just as well use f.read(BLOCKSZ) and
> avoid having to insist that the user explicitly write ", newline=''".

Sorry, but who is "you" in that paragraph?
The csv module currently accepts any iterator yielding lines of text,
not only file objects. Changing this would be a major compatibility
break.

> Excel has no default, because the user has no option -- the defined
> encoding is "cp" + str(codepage_number_derived_from_locale), e.g.
> "cp1252".

Then Excel-generated CSV files all use different encodings? Gasp :-(
msg83381 - (view) Author: Skip Montanaro (skip.montanaro) * (Python triager) Date: 2009-03-09 14:08
John> Options for 3.X:
    John> (1) caller uses mode 'rb', is given bytes objects back.
    John> (2) caller uses mode 'rt' and provides an encoding, is given str
    John>     objects back.

I believe #1 is the only correct option.  If you open a file in text mode
the low-level file reading code will unify all line endings to \n.  I don't
believe it's possible to override that behavior.

Skip
msg83382 - (view) Author: Skip Montanaro (skip.montanaro) * (Python triager) Date: 2009-03-09 14:14
Antoine> However, I think csv should accept files opened in binary mode
    Antoine> and be able to deal with line endings itself. How am I supposed
    Antoine> to know the encoding of a CSV file? Surely Excel uses a
    Antoine> defined, default encoding when exporting to CSV... that
    Antoine> knowledge should be embedded in the csv module.

In fact, the csv module actually requires files to be opened in binary mode
but doesn't enforce that.  That it works in all but a few corner cases is
because so few CSV files contain fields containing embedded newlines.

Why doesn't open() allow you to specify an encoding when opening in binary
mode?  Even though it would be unused by the File object, it would serve as
an annotation for code using that open File object so they can do the
appropriate bytes-to-unicode conversion.

Skip
msg83383 - (view) Author: Skip Montanaro (skip.montanaro) * (Python triager) Date: 2009-03-09 14:18
Antoine> .... The csv module currently accepts any iterator yielding
    Antoine> lines of text, not only file objects....

If the iterator yields bytes and had an encoding attribute that would serve
to allow the csv reader to perform the necessary conversion.  It would be
incumbent upon the user to specify the encoding.

Skip
msg83385 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-03-09 14:40
[newlines translation]
> I don't
> believe it's possible to override that behavior.

It is, see my comments above.

> Why doesn't open() allow you to specify an encoding when opening in binary
> mode?

It would be more logical to pass the encoding argument to the csv reader
object instead.
msg83388 - (view) Author: Skip Montanaro (skip.montanaro) * (Python triager) Date: 2009-03-09 14:59
Antoine> It would be more logical to pass the encoding argument to the
    Antoine> csv reader object instead.

What should be the default?
msg83391 - (view) Author: Skip Montanaro (skip.montanaro) * (Python triager) Date: 2009-03-09 15:22
me> What should be the default?

Scratch that.  If the iterator passed to csv.reader is in a mode which will
cause it to emit bytes instead of unicode objects the caller must give an
encoding.  The csv.reader code will then perform the necessary
bytes-to-unicode conversion.  If bytes are returned by the iterator but no
encoding was given it should raise an exception (something standard?
something new?).

Skip
msg83424 - (view) Author: Skip Montanaro (skip.montanaro) * (Python triager) Date: 2009-03-10 10:42
This issue seems to have simply been overlooked when 3.0 was released.
It should be fixed in the next round of 3.0 and 3.1 updates.  Any
feeback on the idea that the csv.reader constructor (and probably the
DictReader and proposed NamedTupleReader constructors) should take an
optional encoding argument?  In fact, the writers should as well
which would inform the writer how to encode the output when writing.
msg85048 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2009-04-01 16:50
I think it's good if it allowed passing in a binary file and an
encoding, but I think it would be crazy if it wouldn't also take a text
file.  After all the primary purpose of a CSV file, edge cases
notwithstanding, is to look like text to the end user in a text editor
(as opposed to Excel spreadsheets, which look like binary gobbledygook
unless opened with Excel).

Are there any use cases for returning bytes strings instead of text
strings?  If so that should probably be another flag.
msg85108 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2009-04-01 23:02
I've added some unit tests for embedded newlines, and py3k csv passes
(on linux at least) when newline='' is used.  Unless someone can provide
a test case that fails when newline='' is used, I propose we fix the
documentation and leave the code alone.
msg85152 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2009-04-02 03:23
I'm attaching a proposed doc patch for comment.  I replace mentions of
'rb' with "newline=''", including in the examples.  I also deleted the
unicode discussion (since CSV obviously handles unicode now) as well as
the extensive unicode examples that are no longer relevant.  (There are
a couple other tweaks I made as I went along, but nothing substantial.)
msg85229 - (view) Author: Skip Montanaro (skip.montanaro) * (Python triager) Date: 2009-04-02 17:55
David> I've added some unit tests for embedded newlines, and py3k csv
    David> passes (on linux at least) when newline='' is used.  Unless
    David> someone can provide a test case that fails when newline='' is
    David> used, I propose we fix the documentation and leave the code
    David> alone.

This thread is getting a bit long.  Can someone summarize how the expected
usage of the csv module is supposed to change?  If I read things correctly,
instead of requiring (in the general case) that csv files be opened in
binary mode, the requirement will be that they be opened with newline=''.
This will thwart any attempts by the io module at newline translation, but
since the file is still opened in text mode its contents will implicitly be
Unicode (or Unicode translated to bytes with a specific encoding).  That
encoding will also be specified in the call to open().

Is this about correct?  Do any test cases need to be updated or added?  I
notice that something called BytesIO is imported from io but not used.  Were
some test cases removed which used to involve that class or is that a 2to3
artifact?

Skip
msg85234 - (view) Author: Skip Montanaro (skip.montanaro) * (Python triager) Date: 2009-04-02 18:29
David> I also deleted the unicode discussion (since CSV obviously
    David> handles unicode now) ...

Maybe there should be a simple example showing use of the encoding parameter
to open() to encode Unicode on write and decode to Unicode on read?

Skip
msg85238 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2009-04-02 18:41
On Thu, 2 Apr 2009 at 17:55, Skip Montanaro wrote:
> This thread is getting a bit long.  Can someone summarize how the expected
> usage of the csv module is supposed to change?  If I read things correctly,
> instead of requiring (in the general case) that csv files be opened in
> binary mode, the requirement will be that they be opened with newline=''.
> This will thwart any attempts by the io module at newline translation, but
> since the file is still opened in text mode its contents will implicitly be
> Unicode (or Unicode translated to bytes with a specific encoding).  That
> encoding will also be specified in the call to open().

I believe that is an accurate summary.

> Is this about correct?  Do any test cases need to be updated or added?  I
> notice that something called BytesIO is imported from io but not used.  Were
> some test cases removed which used to involve that class or is that a 2to3
> artifact?

I will look in to this, and add an encoding example to the docs as you
suggest in another email.

--David
msg85336 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2009-04-03 21:56
So this is a doc bug? If so, need it still block tomorrow's release?
msg85352 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2009-04-04 00:25
Nope.  Sorry I forgot to change the priority.
msg85362 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2009-04-04 01:31
> Is this about correct?  Do any test cases need to be updated or added?  I
> notice that something called BytesIO is imported from io but not used.
 Were
> some test cases removed which used to involve that class or is that a 2to3
> artifact?

The set of test cases is almost exactly parallel between 2.7 and 3.1. 
3.1 adds an additional DictReader test, and refactors one set of tests
to make the code simpler.  Other than that, it is just a matter of
replacing the file opening and closing code that uses binary mode with
'with' statements where the open uses newline=''.  BytesIO is not used
so would appear to be a conversion artifact.
msg85363 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2009-04-04 01:33
Oh, yeah, other 3.1 differences are that the unicode test is uncommented
and updated, and a test is added to make sure nulls are handled correctly.
msg85365 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2009-04-04 01:48
Doc patch applied in r71116.
History
Date User Action Args
2022-04-11 14:56:43adminsetgithub: 49097
2009-04-05 16:28:33georg.brandllinkissue5455 superseder
2009-04-04 01:48:02r.david.murraysetstatus: open -> closed
resolution: fixed
messages: + msg85365
2009-04-04 01:33:54r.david.murraysetmessages: + msg85363
2009-04-04 01:31:38r.david.murraysetmessages: + msg85362
2009-04-04 00:25:50r.david.murraysetpriority: release blocker -> normal

messages: + msg85352
2009-04-03 21:56:45benjamin.petersonsetnosy: + benjamin.peterson
messages: + msg85336
2009-04-02 18:41:47r.david.murray-oldsetnosy: + r.david.murray-old
messages: + msg85238
2009-04-02 18:29:50skip.montanarosetmessages: + msg85234
2009-04-02 17:55:26skip.montanarosetmessages: + msg85229
2009-04-02 03:52:40r.david.murraysetfiles: + issue4847-doc.patch
assignee: georg.brandl -> r.david.murray
2009-04-02 03:51:44r.david.murraysetfiles: - issue4847-doc.patch
2009-04-02 03:23:14r.david.murraysetfiles: + issue4847-doc.patch
type: behavior
messages: + msg85152

stage: commit review
2009-04-01 23:02:09r.david.murraysetnosy: + r.david.murray
messages: + msg85108
2009-04-01 16:50:03gvanrossumsetnosy: + gvanrossum
messages: + msg85048
2009-03-10 10:42:56skip.montanarosetpriority: normal -> release blocker

messages: + msg83424
2009-03-09 15:22:14skip.montanarosetmessages: + msg83391
2009-03-09 14:59:37skip.montanarosetmessages: + msg83388
2009-03-09 14:40:34pitrousetmessages: + msg83385
2009-03-09 14:18:06skip.montanarosetmessages: + msg83383
2009-03-09 14:14:11skip.montanarosetmessages: + msg83382
2009-03-09 14:08:25skip.montanarosetmessages: + msg83381
2009-03-09 11:40:12pitrousetmessages: + msg83372
2009-03-09 11:32:14sjmachinsetmessages: + msg83370
2009-03-09 10:39:33pitrousetmessages: + msg83368
2009-03-09 06:49:09sjmachinsetmessages: + msg83359
versions: + Python 3.1
2009-03-09 06:02:14sjmachinsetnosy: + skip.montanaro
messages: + msg83357
2009-03-09 05:11:31jdwhitleysetfiles: + csv.diff
nosy: + jdwhitley
messages: + msg83356

2009-02-24 07:25:42sjmachinsetnosy: + sjmachin
messages: + msg82661
2009-01-06 01:31:13vstinnersetfiles: + csv_doc.patch
keywords: + patch
messages: + msg79224
2009-01-05 17:51:28jaywalkersetmessages: + msg79176
2009-01-05 17:47:20pitrousetpriority: normal
assignee: georg.brandl
messages: + msg79174
components: + Documentation, - Library (Lib)
nosy: + georg.brandl, pitrou
2009-01-05 17:29:53jaywalkersetmessages: + msg79172
2009-01-05 17:23:43vstinnersetmessages: + msg79171
2009-01-05 17:19:12jaywalkersetmessages: + msg79170
2009-01-05 17:05:58vstinnersetnosy: + vstinner
messages: + msg79166
2009-01-05 17:03:24jaywalkercreate