Issue 23041: csv needs more quoting rules

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/67230

classification

Title:	csv needs more quoting rules
Type:	enhancement	Stage:	patch review
Components:	Library (Lib)	Versions:	Python 3.11

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:	skip.montanaro	Nosy List:	berker.peksag, erdnaxeli, krypten, msetina, r.david.murray, rhettinger, samwyse, serhiy.storchaka, skip.montanaro, tegdev, yoonghm
Priority:	normal	Keywords:	easy, patch

Created on 2014-12-12 16:36 by samwyse, last changed 2022-04-11 14:58 by admin.

Files
File name	Uploaded	Description	Edit
issue23041.patch	krypten, 2014-12-14 05:14	Patch for resolving issue 23041. According to message 232560 and 232563	review
issue23041_test.patch	krypten, 2014-12-14 05:45	Test cases added for testing as behaviour proposed in message 232560	review

Pull Requests
URL	Status	Linked	Edit
PR 29469	open	skip.montanaro, 2021-11-08 13:57

Messages (26)
msg232560 - (view)	Author: Samwyse (samwyse) *	Date: 2014-12-12 16:36
The csv module currently implements four quoting rules for dialects: QUOTE_MINIMAL, QUOTE_ALL, QUOTE_NONNUMERIC and QUOTE_NONE. These rules treat values of None the same as an empty string, i.e. by outputting two consecutive quotes. I propose the addition of two new rules, QUOTE_NOTNULL and QUOTE_STRINGS. The former behaves like QUOTE_ALL while the later behaves like QUOTE_NONNUMERIC, except that in both cases values of None are output as an empty field. Examples follow. Current behavior (which will remain unchanged) >>> csv.register_dialect('quote_all', quoting=csv.QUOTE_ALL) >>> csv.writer(sys.stdout, dialect='quote_all').writerow(['foo', None, 42]) "foo","","42" >>> csv.register_dialect('quote_nonnumeric', quoting=csv.QUOTE_NONNUMERIC) >>> csv.writer(sys.stdout, dialect='quote_nonnumeric').writerow(['foo', None, 42]) "foo","",42 Proposed behavior >>> csv.register_dialect('quote_notnull', quoting=csv.QUOTE_NOTNULL) >>> csv.writer(sys.stdout, dialect='quote_notnull').writerow(['foo', None, 42]) "foo",,"42" >>> csv.register_dialect('quote_strings', quoting=csv.QUOTE_STRINGS) >>> csv.writer(sys.stdout, dialect='quote_strings').writerow(['foo', None, 42]) "foo",,42
msg232561 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2014-12-12 16:52
As an enhancement, this could be added only to 3.5. The proposal sounds reasonable to me.
msg232563 - (view)	Author: Samwyse (samwyse) *	Date: 2014-12-12 17:27
David: That's not a problem for me. Sorry I can't provide real patches, but I'm not in a position to compile (much less test) the C implementation of _csv. I've looked at the code online and below are the changes that I think need to be made. My use cases don't require special handing when reading empty fields, so the only changes I've made are to the code for writers. I did verify that the reader code mostly only checks for QUOTE_NOTNULL when parsing. This means that completely empty fields will continue to load as zero-length strings, not None. I won't stand in the way of anyone wanting to "fix" that for these new rules. typedef enum { QUOTE_MINIMAL, QUOTE_ALL, QUOTE_NONNUMERIC, QUOTE_NONE, QUOTE_STRINGS, QUOTE_NOTNULL } QuoteStyle; static StyleDesc quote_styles[] = { { QUOTE_MINIMAL, "QUOTE_MINIMAL" }, { QUOTE_ALL, "QUOTE_ALL" }, { QUOTE_NONNUMERIC, "QUOTE_NONNUMERIC" }, { QUOTE_NONE, "QUOTE_NONE" }, { QUOTE_STRINGS, "QUOTE_STRINGS" }, { QUOTE_NOTNULL, "QUOTE_NOTNULL" }, { 0 } }; switch (dialect->quoting) { case QUOTE_NONNUMERIC: quoted = !PyNumber_Check(field); break; case QUOTE_ALL: quoted = 1; break; case QUOTE_STRINGS: quoted = PyString_Check(field); break; case QUOTE_NOTNULL: quoted = field != Py_None; break; default: quoted = 0; break; } " csv.QUOTE_MINIMAL means only when required, for example, when a\n" " field contains either the quotechar or the delimiter\n" " csv.QUOTE_ALL means that quotes are always placed around fields.\n" " csv.QUOTE_NONNUMERIC means that quotes are always placed around\n" " fields which do not parse as integers or floating point\n" " numbers.\n" " csv.QUOTE_STRINGS means that quotes are always placed around\n" " fields which are strings. Note that the Python value None\n" " is not a string.\n" " csv.QUOTE_NOTNULL means that quotes are only placed around fields\n" " that are not the Python value None.\n"
msg232596 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2014-12-13 06:46
Samwyse, are these suggestions just based on ideas of what could be done or have you encountered real-world CSV data exchanges that couldn't be handled by the CSV module?
msg232607 - (view)	Author: Skip Montanaro (skip.montanaro) *	Date: 2014-12-13 13:05
It doesn't look like a difficult change, but is it really needed? I guess my reaction is the same as Raymond's. Are there real-world uses where the current set of quoting styles isn't sufficient?
msg232630 - (view)	Author: Chaitanya agrawal (krypten) *	Date: 2014-12-14 05:14
Used function PyUnicode_Check instead of PyString_Check
msg232676 - (view)	Author: Samwyse (samwyse) *	Date: 2014-12-15 19:40
Yes, it's based on a real-world need. I work for a Fortune 500 company and we have an internal tool that exports CSV files using what I've described as the QUOTE_NOTNULL rules. I need to create similar files for re-importation. Right now, I have to post-process the output of my Python program to get it right. I added in the QUOTE_STRINGS rule for completeness. I think these two new rules would be useful for anyone wanting to create sparse CSV files.
msg232677 - (view)	Author: Skip Montanaro (skip.montanaro) *	Date: 2014-12-15 19:52
If I understand correctly, your software needs to distinguish between # wrote ["foo", "", 42, None] with quote_all in effect "foo","","42","" and # wrote ["foo", None, 42, ""] with quote_nonnull in effect "foo",,"42","" so you in effect want to transmit some type information through a CSV file?
msg232681 - (view)	Author: Samwyse (samwyse) *	Date: 2014-12-15 21:17
Skip, I don't have any visibility into how the Java program I'm feeding data into works, I'm just trying to replicate the csv files that it exports as accurately as possible. It has several other quirks, but I can replicate all of them using Dialects; this is the only "feature" I can't. The files I'm looking at have quoted strings and numbers, but there aren't any quoted empty strings. I'm using a DictWriter to create similar csv files, where missing keys are treated as values of None, so I'd like those printed without quotes. If we also want to print empty strings without quotes, that wouldn't impact me at all. Besides my selfish needs, this could be useful to anyone wanting to reduce the save of csv files that have lots of empty fields, but wants to quote all non-empty values. This may be an empty set, I don't know.
msg261141 - (view)	Author: Skip Montanaro (skip.montanaro) *	Date: 2016-03-02 23:06
Thanks for the update berker.peksag. I'm still not convinced that the csv module should be modified just so one user (sorry samwyse) can match the input format of someone's Java program. It seems a bit like trying to make the csv module type-sensitive. What happens when someone finds a csv file containing timestamps in a format other than the datetime.datetime object will produce by default? Why is None special as an object where bool(obj) is False? I think the better course here is to either: * subclass csv.DictWriter, use dictionaries as your element type, and have its writerow method do the application-specific work. * define a writerow() function which does something similar (essentially wrapping csv.writerow()). If someone else thinks this is something which belongs in Python's csv module, feel free to reopen and assign it to yourself.
msg261146 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2016-03-03 05:37
The csv module is already type-sensitive (with QUOTE_NONNUMERIC). I agree, that we shouldn't modify the csv module just for one user and one program. If a standard CVS library in Java (or other popular laguages) differentiates between empty string and null value when read from CSV, it would be a serious argument to support this in Python. Quick search don't find this.
msg261147 - (view)	Author: Berker Peksag (berker.peksag) *	Date: 2016-03-03 07:09
I was thinking adding a more flexible API like: ... spamwriter = csv.writer(csvfile, quoting_callable=lambda field: field is not None) ... But that would require too much change in the csv module (or at least its implementation wouldn't be trivial). I agree that subclassing DictWriter is a much better way to achieve this.
msg341460 - (view)	Author: (tegdev)	Date: 2019-05-05 17:19
The correct handling of None values belongs to the csv module. There is a use case to migrate a DB2 database to PostgreSQL. DB2 has a command line tool "db2 export ..." which produces csv-files. A row ['Hello', null, 'world'] is exported to "Hello,,"world". I would like to read in these exports with python and put it to PostgreSQL. But with the csv library I can't read it in correctly. The input is converted to: ['Hello', '', 'world'] It should read as: ['Hello', None, 'world'] It is pretty easy to write a correct CSV reader with ANTLR but it's terribly slow. And last but not least: if someone writes a list the reading should the identity. Thats not True for the csv libraray. Example: import csv hello_out_lst = ['Hello', None, 'world'] with open('hello.csv', 'w') as ofh: writer = csv.writer(ofh, delimiter=',') writer.writerow(hello_out_lst) with open('hello.csv', 'r') as ifh: reader = csv.reader(ifh, delimiter=',') for row in reader: hello_in_lst = row is_equal = hello_out_lst == hello_in_lst print(f'{hello_out_lst} is equal {hello_in_lst} ? {is_equal}') The result is: ['Hello', None, 'world'] is equal ['Hello', '', 'world'] ? False
msg358461 - (view)	Author: Yoong Hor Meng (yoonghm)	Date: 2019-12-16 02:20
There is a real requirement for csv to handle an empty field vs a empty string """". csv.QUOTE_NOTNULL could be useful.
msg396621 - (view)	Author: Alexandre Morignot (erdnaxeli)	Date: 2021-06-28 09:23
I have another use case. I want to import data into PostgreSQL using the COPY FROM command. This command can read a CSV input and it needs to distinguish empty string from null values. Could we reconsider this issue and the proposed solution?
msg396641 - (view)	Author: Skip Montanaro (skip.montanaro) *	Date: 2021-06-28 15:57
Okay, I'll reopen this, at least for the discussion of QUOTE_NONNULL. @erdnaxeli please given an example of how PostgreSQL distinguishes between the empty string and None cases. Is it a quoted empty string vs an empty field? If so, modifying @samwyse's original example, is this what you are after? >>> csv.register_dialect('quote_notnull', quoting=csv.QUOTE_NOTNULL) >>> csv.writer(sys.stdout, dialect='quote_notnull').writerow(['', None, 42]) "",,"42" ? Can you modify the original two patches to restrict to QUOTE_NONNULL?
msg396642 - (view)	Author: Skip Montanaro (skip.montanaro) *	Date: 2021-06-28 15:58
Missed tweaking a couple settings.
msg396643 - (view)	Author: Skip Montanaro (skip.montanaro) *	Date: 2021-06-28 16:00
Ugh... s/QUOTE_NONNULL/QUOTE_NOTNULL/ Not, Non, None... Perl would treat them all the same, right? <wink>
msg401607 - (view)	Author: Miha Šetina (msetina)	Date: 2021-09-10 18:28
The MS Synapse has a CSV support in its COPY command that would benefit from the proposed csv.QUOTE_NOTNULL as it can be used when preparing data for import. As they reference the same RFC the CSV modul should support, both the proposed modifications would extend the RFC functionality with support for NULL/None values in the data. The csv.QUOTE_STRINGS would enable a smaller file size. Our case is that we are moving data into Synapse and have to resort to risk prone string replace functionality to provide the MS Synapse COPY a proper support for empty string and NULL
msg401608 - (view)	Author: Skip Montanaro (skip.montanaro) *	Date: 2021-09-10 18:35
Update version - too late for anything older than 3.11.
msg405951 - (view)	Author: Skip Montanaro (skip.montanaro) *	Date: 2021-11-08 14:02
Note to @samwyse and @krypten: I updated the patches and created a pull request on GitHub, but I have no way of knowing if at least krypten has signed a CLA for Python. Since you're the author of the original patches, we need to verify that you have.
msg406013 - (view)	Author: Skip Montanaro (skip.montanaro) *	Date: 2021-11-09 12:06
Further question... All the discussion has been on the writer side of the csv module. Is there any reason that using QUOTE_STRINGS or QUOTE_NOTNULL should have an effect when reading? For example, should this line on input "",,1,'a' produce this list ["", None, "1", "a"] with QUOTE_NOTNULL in effect, and ["", "", 1, "a"] or ["", None, 1, "a"] with QUOTE_STRINGS in effect?
msg406015 - (view)	Author: Miha Šetina (msetina)	Date: 2021-11-09 12:33
I would say that it should follow a rule: both cases values of None are output as an empty field so empty filed should map to None with both QUOTE_NOTNULL and QUOTE_STRINGS that would make: ["", None, 1, "a"] for QUOTE_STRINGS in effect. This would also make it possible to have circular usage. Export with one quoting rule should be able to read with the same rule and produce same data.
msg406017 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2021-11-09 12:51
The quoting style affects not only the CSV formatting, but also the CSV parsing. How QUOTE_NOTNULL and QUOTE_STRINGS will affect parsing?
msg412463 - (view)	Author: Miha Šetina (msetina)	Date: 2022-02-03 20:08
Is this still on track for python 3.11?
msg413867 - (view)	Author: Samwyse (samwyse) *	Date: 2022-02-24 00:07
I just signed the contributor agreement. (Thought I had done that last year but I don’t see any emails. Is there any place to check?) I agree that round-tripping should Bebe possible for any value of quoting. Hopefully this will finally get done before its eighth birthday.

History
Date	User	Action	Args
2022-04-11 14:58:11	admin	set	github: 67230
2022-02-24 00:07:17	samwyse	set	messages: + msg413867
2022-02-03 20:08:22	msetina	set	messages: + msg412463
2021-11-09 12:51:14	serhiy.storchaka	set	messages: + msg406017
2021-11-09 12:33:11	msetina	set	messages: + msg406015
2021-11-09 12:06:46	skip.montanaro	set	messages: + msg406013
2021-11-08 14:02:34	skip.montanaro	set	messages: + msg405951
2021-11-08 13:57:11	skip.montanaro	set	stage: needs patch -> patch review pull_requests: + pull_request27722
2021-09-10 18:35:17	skip.montanaro	set	messages: + msg401608 versions: - Python 3.6, Python 3.7, Python 3.8, Python 3.9, Python 3.10
2021-09-10 18:28:11	msetina	set	nosy: + msetina messages: + msg401607 versions: + Python 3.6, Python 3.7, Python 3.8, Python 3.9, Python 3.10
2021-06-28 16:00:31	skip.montanaro	set	messages: + msg396643
2021-06-28 15:58:26	skip.montanaro	set	resolution: rejected -> stage: resolved -> needs patch messages: + msg396642 versions: + Python 3.11, - Python 3.8
2021-06-28 15:57:52	skip.montanaro	set	status: closed -> open messages: + msg396641
2021-06-28 09:23:46	erdnaxeli	set	nosy: + erdnaxeli messages: + msg396621
2019-12-16 02:20:00	yoonghm	set	nosy: + yoonghm messages: + msg358461 versions: + Python 3.8, - Python 3.6
2019-05-05 17:19:38	tegdev	set	nosy: + tegdev messages: + msg341460
2016-03-03 07:09:55	berker.peksag	set	status: open -> closed messages: + msg261147 stage: patch review -> resolved
2016-03-03 05:37:28	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg261146
2016-03-02 23:06:14	skip.montanaro	set	resolution: rejected messages: + msg261141
2016-02-28 05:13:01	berker.peksag	set	nosy: + berker.peksag stage: needs patch -> patch review versions: + Python 3.6, - Python 3.5
2014-12-15 21:17:02	samwyse	set	messages: + msg232681
2014-12-15 19:52:44	skip.montanaro	set	messages: + msg232677
2014-12-15 19:40:10	samwyse	set	messages: + msg232676
2014-12-14 05:45:01	krypten	set	files: + issue23041_test.patch
2014-12-14 05:14:55	krypten	set	files: + issue23041.patch nosy: + krypten messages: + msg232630 keywords: + patch
2014-12-13 13:05:03	skip.montanaro	set	messages: + msg232607
2014-12-13 06:46:50	rhettinger	set	assignee: skip.montanaro messages: + msg232596 nosy: + rhettinger, skip.montanaro
2014-12-12 17:27:41	samwyse	set	messages: + msg232563
2014-12-12 16:52:33	r.david.murray	set	versions: - Python 2.7, Python 3.2, Python 3.3, Python 3.4, Python 3.6 nosy: + r.david.murray messages: + msg232561 keywords: + easy stage: needs patch
2014-12-12 16:36:45	samwyse	create