classification
Title: CSV Null Byte Error
Type: enhancement Stage:
Components: Extension Modules Versions: Python 3.6
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: bobbyocean, r.david.murray, serhiy.storchaka, skip.montanaro
Priority: normal Keywords:

Created on 2016-07-21 05:02 by bobbyocean, last changed 2016-07-21 22:12 by skip.montanaro.

Files
File name Uploaded Description Edit
nul.csv skip.montanaro, 2016-07-21 14:12
Opening_CSV.png bobbyocean, 2016-07-21 21:16
Messages (9)
msg270908 - (view) Author: Bobby Ocean (bobbyocean) Date: 2016-07-21 05:02
I think this has been asked before, but it has been awhile and I think needs to be re-addressed. 

Why doesn't the CSV module handle NULL bytes? 

Let me do some examples, 

ascii = [chr(n) for n in range(256)] #No errors
print(ascii) #No errors
print(dict(zip(ascii,ascii))) #No errors

open('temp','r').writelines(ascii) #No errors
f=open('temp','r')
for line in f: print(line) #No errors
f.close()

Python hsndles every ascii chr, DEL, ESC, RETURN, etc. It displays those characters, uses them as keys or values, etc. 

But now try, 

import csv
data = csv.DictReader(ascii) 
for line in data: print(line) #NULL Byte Error

But we can quick fix this easily, 

ascii.pop(0)
data = csv.DictReader(ascii)
for line in data: print(line) #No Errors

Doesn't it seem odd that we allow the use of every chr as keys, vslues, prints, etc. but then we hold a special exception for using the csv module and yhat special exception is not the ESC or DEL or any other non-printable chr, the exception is for the NULL?
msg270923 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2016-07-21 13:40
By "discussed before" I presume you are referring to issue 1599055.

It is true that have been changes since then in Python's handling of null bytes.  Perhaps it is indeed time to revisit this.  I'll leave that to the experts...this can be closed as a duplicate of issue 1599055 if I'm wrong about things having possibly changed in the interim.
msg270926 - (view) Author: Skip Montanaro (skip.montanaro) * (Python triager) Date: 2016-07-21 14:12
Beyond whether or not the csv module can handle NUL bytes, you might figure out if Excel will. Since the CSV format isn't some sort of "standard", its operational definition has always been what Excel will produce or consume.

I don't have Excel (not a Windows guy), but I do have Gnumeric and LibreOffice. I constructed a simple CSV file by hand which contains several NUL bytes to see what they would do. Gnumeric pops up a dialog and converts them to spaces (and then, oddly enough, doesn't think the file has been modified). LibreOffice didn't complain while loading the file, but when I saved it, it silently deleted the NULs.

I've attached the file should anyone like to experiment with other spreadsheets.
msg270939 - (view) Author: Bobby Ocean (bobbyocean) Date: 2016-07-21 16:14
@ Skip Montanaro, 

Actually, CSV has nothing to do with excel. Excel is a gui that processes CSV like many other programs. CSV stands for comma seperated values and is basic standard for data.https://en.m.wikipedia.org/wiki/Comma-separated_values

As far as I know, many programs handle NULL bytes fine (either as a empty atring or a # or something else). In any case, it really should be irrelevent how some particular programs handle a CSV file.
msg270945 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2016-07-21 17:11
Bobby: Skip is one of the authors of the csv module, and has been maintaining it since it was added to the standard library.  He knows whereof he speaks: there is no standard for csv (as noted in the article you link), and all csv parsers that want to be interoperable refer back to Microsoft's implementation when dealing with any quirks.  That implementation currently is Excel.

That said, your are right that others have adopted the format, and there is an argument to be made that we don't have to *limit* ourselves to what Microsoft supports.  Although we probably don't want to be emitting stuff that they don't without being explicit about it.
msg270948 - (view) Author: Bobby Ocean (bobbyocean) Date: 2016-07-21 17:27
I am sorry I must have mis-read the history part of that article.
msg270951 - (view) Author: Skip Montanaro (skip.montanaro) * (Python triager) Date: 2016-07-21 18:11
I wasn't familiar with RFC 4180. (Or, quite possibly, I forgot I used to be familiar with it.) Note that at the bottom of the BNF definition of the file structure is the definition of TEXTDATA:

TEXTDATA =  %x20-21 / %x23-2B / %x2D-7E

That pretty explicitly excludes NUL bytes (%x00 I think).

I'd still like someone load my nul.csv file into Excel and report back what it does (or doesn't) do with it.
msg270959 - (view) Author: Bobby Ocean (bobbyocean) Date: 2016-07-21 21:16
I attempted to open the file in excel; raised no errors. 

In any case (regardless of Microsoft-concerns), I am glad to see a discussion started and possibly some concern that an update might be useful to the community (it would certainly cut down on the number of stack-exchange posts about this very topic). I certainly would put my vote to have csv handle the NULL byte, in the same way as python does natively. 

Thanks for you time. Oh, and since you are one of the author's, thanks for writing/working on the csv module, VERY USEFUL. :-)
msg270964 - (view) Author: Skip Montanaro (skip.montanaro) * (Python triager) Date: 2016-07-21 22:12
Thanks. The display you showed looks about like I saw in LibreOffice. If you export it back to another CSV file, does the new file match the original exactly, or does (like LibreOffice) it save a file without NUL bytes?

I don't mind having the discussion, but even though we have traditionally treated CSV files as binary files in Python (at least when I was closely involved in the 2.x days), that was mostly so end-of-line sequences weren't corrupted. As others have pointed out, in 2.x Python String objects stored the data as a normal NUL-terminated pointer-to-char for efficiency when interacting with C libraries. C uses NUL as a string terminator, so we couldn't work with embedded NULs. I haven't looked at the 3.x string stuff (I know Unicode is much more intimately involved). If it still maintains that close working relationship with the typical C strings, supporting NUL bytes will be problematic.

In cases where the underlying representation isn't quite what I want, I've been able to get away with a file wrapper which suitably mangles the input before passing it up the chain to the csv module. For example, the __next__ method of your file wrapper could delete NULs or replace them with something suitably innocuous, like "\001", or some other non-printable character you are certain won't be in the input. If you want to preserve NULs, reverse the translation during the write().
History
Date User Action Args
2016-07-21 22:12:58skip.montanarosetmessages: + msg270964
2016-07-21 21:16:30bobbyoceansetfiles: + Opening_CSV.png

messages: + msg270959
2016-07-21 18:11:54skip.montanarosetmessages: + msg270951
2016-07-21 17:27:09bobbyoceansetmessages: + msg270948
2016-07-21 17:11:48r.david.murraysetmessages: + msg270945
2016-07-21 16:14:52bobbyoceansetmessages: + msg270939
2016-07-21 14:12:34skip.montanarosetfiles: + nul.csv

messages: + msg270926
2016-07-21 13:40:57r.david.murraysetnosy: + r.david.murray, skip.montanaro, serhiy.storchaka
messages: + msg270923
2016-07-21 09:03:57SilentGhostsetversions: + Python 3.6, - Python 3.5
2016-07-21 05:02:48bobbyoceancreate