classification
Title: csv module cannot handle embedded \r
Type: behavior Stage: test needed
Components: Extension Modules, Library (Lib) Versions: Python 2.6
process
Status: open Resolution:
Dependencies: CSV regression in 2.5a1: multi-line cells
View: 1465014
Superseder:
Assigned To: andrewmcnamara Nosy List: ajaksu2, andrewmcnamara, gnbond, goodger, rhettinger, skip.montanaro (6)
Priority: normal Keywords

Created on 2004-06-07 04:46 by gnbond, last changed 2009-02-14 13:56 by ajaksu2.

Files
File name Uploaded Description Edit Remove
tcsv.py gnbond, 2004-06-07 04:47
Messages (7)
msg21057 - (view) Author: Gregory Bond (gnbond) Date: 2004-06-07 04:46
CSV module cannot handle the case of embedded \r (i.e.
carriage return) in a field.

As far as I can see, this is hard-coded into the _csv.c
file and cannot be fixed with Dialect changes.
msg21058 - (view) Author: Raymond Hettinger (rhettinger) Date: 2004-06-07 05:02
Logged In: YES 
user_id=80475

Skip, does this coincide with your planned switchover to
universal newlines?
msg21059 - (view) Author: Andrew McNamara (andrewmcnamara) Date: 2004-06-07 05:32
Logged In: YES 
user_id=698599

I suspect this restriction (CR appearing within a quoted 
field) is a historical accident and can be safely removed. 
msg21060 - (view) Author: Skip Montanaro (skip.montanaro) Date: 2004-06-07 11:25
Logged In: YES 
user_id=44345

It certainly intersects with it somehow.  ;-)  If nothing else, it
will serve as a useful test case.
msg21061 - (view) Author: Andrew McNamara (andrewmcnamara) Date: 2005-01-13 11:34
Logged In: YES 
user_id=698599

If you're interested, I've just checked in a change to the CVS head for 
Python 2.5 that may, at least partially, fix this problem (if you try it, let me 
know how it goes).
msg21062 - (view) Author: David Goodger (goodger) Date: 2006-04-05 15:35
Logged In: YES 
user_id=7733

I just filed a bug (http://www.python.org/sf/1465014) that
seems to be related to this. Revision 38290 on
Modules/_csv.c includes the addition of this code:

    else if (c == '\n' || c == '\r') {
  	self->state = EAT_CRNL;
  	break;
    }

(and similar). This seems to be eating (deleting) control
chars, but newlines used to be significant. 

Embedded line breaks are allowed, according to RFC 4180
(http://www.ietf.org/rfc/rfc4180.txt). And according to the
Wikipedia entry
(http://en.wikipedia.org/wiki/Comma-separated_values), "a
line break within an element must be preserved."
msg82052 - (view) Author: Daniel Diniz (ajaksu2) Date: 2009-02-14 13:56
IIUC, I get the correct behavior:

trunk-py$ ./python ~/Desktop/tcsv.py
['fld1', 'fld2', 'fld3 ', 'fld4']
['fld1', 'fld2', 'fld3 \r', 'fld4']

trunk-py$ cat ~/Desktop/tcsv.py
#! /usr/local/bin/python

import csv

d = 'fld1,fld2,"fld3 ",fld4\r\n'
d2 = 'fld1,fld2,"fld3 \r'
d3 = '",fld4\r\n'

r = csv.reader([d, d2, d3], dialect="excel")
for f in r:
        print f
History
Date User Action Args
2009-02-14 13:56:28ajaksu2setversions: + Python 2.6
nosy: + ajaksu2
messages: + msg82052
dependencies: + CSV regression in 2.5a1: multi-line cells
components: + Extension Modules
type: behavior
stage: test needed
2004-06-07 04:46:56gnbondcreate