classification
Title: csv needs more quoting rules
Type: enhancement Stage: patch review
Components: Library (Lib) Versions: Python 3.11
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: skip.montanaro Nosy List: berker.peksag, erdnaxeli, krypten, msetina, r.david.murray, rhettinger, samwyse, serhiy.storchaka, skip.montanaro, tegdev, yoonghm
Priority: normal Keywords: easy, patch

Created on 2014-12-12 16:36 by samwyse, last changed 2021-11-09 12:51 by serhiy.storchaka.

Files
File name Uploaded Description Edit
issue23041.patch krypten, 2014-12-14 05:14 Patch for resolving issue 23041. According to message 232560 and 232563 review
issue23041_test.patch krypten, 2014-12-14 05:45 Test cases added for testing as behaviour proposed in message 232560 review
Pull Requests
URL Status Linked Edit
PR 29469 open skip.montanaro, 2021-11-08 13:57
Messages (24)
msg232560 - (view) Author: Samwyse (samwyse) * Date: 2014-12-12 16:36
The csv module currently implements four quoting rules for dialects: QUOTE_MINIMAL, QUOTE_ALL, QUOTE_NONNUMERIC and QUOTE_NONE.  These rules treat values of None the same as an empty string, i.e. by outputting two consecutive quotes.  I propose the addition of two new rules, QUOTE_NOTNULL and QUOTE_STRINGS.  The former behaves like QUOTE_ALL while the later behaves like QUOTE_NONNUMERIC, except that in both cases values of None are output as an empty field.  Examples follow.


Current behavior (which will remain unchanged)

>>> csv.register_dialect('quote_all', quoting=csv.QUOTE_ALL)
>>> csv.writer(sys.stdout, dialect='quote_all').writerow(['foo', None, 42])
"foo","","42"

>>> csv.register_dialect('quote_nonnumeric', quoting=csv.QUOTE_NONNUMERIC)
>>> csv.writer(sys.stdout, dialect='quote_nonnumeric').writerow(['foo', None, 42])
"foo","",42


Proposed behavior

>>> csv.register_dialect('quote_notnull', quoting=csv.QUOTE_NOTNULL)
>>> csv.writer(sys.stdout, dialect='quote_notnull').writerow(['foo', None, 42])
"foo",,"42"

>>> csv.register_dialect('quote_strings', quoting=csv.QUOTE_STRINGS)
>>> csv.writer(sys.stdout, dialect='quote_strings').writerow(['foo', None, 42])
"foo",,42
msg232561 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-12-12 16:52
As an enhancement, this could be added only to 3.5.  The proposal sounds reasonable to me.
msg232563 - (view) Author: Samwyse (samwyse) * Date: 2014-12-12 17:27
David:  That's not a problem for me.

Sorry I can't provide real patches, but I'm not in a position to compile (much less test) the C implementation of _csv.  I've looked at the code online and below are the changes that I think need to be made.  My use cases don't require special handing when reading empty fields, so the only changes I've made are to the code for writers.  I did verify that the reader code mostly only checks for QUOTE_NOTNULL when parsing.  This means that completely empty fields will continue to load as zero-length strings, not None.  I won't stand in the way of anyone wanting to "fix" that for these new rules.



typedef enum {
    QUOTE_MINIMAL, QUOTE_ALL, QUOTE_NONNUMERIC, QUOTE_NONE,
    QUOTE_STRINGS, QUOTE_NOTNULL
} QuoteStyle;



static StyleDesc quote_styles[] = {
    { QUOTE_MINIMAL,    "QUOTE_MINIMAL" },
    { QUOTE_ALL,        "QUOTE_ALL" },
    { QUOTE_NONNUMERIC, "QUOTE_NONNUMERIC" },
    { QUOTE_NONE,       "QUOTE_NONE" },
    { QUOTE_STRINGS,    "QUOTE_STRINGS" },
    { QUOTE_NOTNULL,    "QUOTE_NOTNULL" },
    { 0 }
};



        switch (dialect->quoting) {
        case QUOTE_NONNUMERIC:
            quoted = !PyNumber_Check(field);
            break;
        case QUOTE_ALL:
            quoted = 1;
            break;
        case QUOTE_STRINGS:
            quoted = PyString_Check(field);
            break;
        case QUOTE_NOTNULL:
            quoted = field != Py_None;
            break;
        default:
            quoted = 0;
            break;
        }



"        csv.QUOTE_MINIMAL means only when required, for example, when a\n"
"            field contains either the quotechar or the delimiter\n"
"        csv.QUOTE_ALL means that quotes are always placed around fields.\n"
"        csv.QUOTE_NONNUMERIC means that quotes are always placed around\n"
"            fields which do not parse as integers or floating point\n"
"            numbers.\n"
"        csv.QUOTE_STRINGS means that quotes are always placed around\n"
"            fields which are strings.  Note that the Python value None\n"
"            is not a string.\n"
"        csv.QUOTE_NOTNULL means that quotes are only placed around fields\n"
"            that are not the Python value None.\n"
msg232596 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2014-12-13 06:46
Samwyse, are these suggestions just based on ideas of what could be done or have you encountered real-world CSV data exchanges that couldn't be handled by the CSV module?
msg232607 - (view) Author: Skip Montanaro (skip.montanaro) * (Python triager) Date: 2014-12-13 13:05
It doesn't look like a difficult change, but is it really needed? I guess my reaction is the same as Raymond's. Are there real-world uses where the current set of quoting styles isn't sufficient?
msg232630 - (view) Author: Chaitanya agrawal (krypten) * Date: 2014-12-14 05:14
Used function PyUnicode_Check instead of PyString_Check
msg232676 - (view) Author: Samwyse (samwyse) * Date: 2014-12-15 19:40
Yes, it's based on a real-world need.  I work for a Fortune 500 company and we have an internal tool that exports CSV files using what I've described as the QUOTE_NOTNULL rules.  I need to create similar files for re-importation.  Right now, I have to post-process the output of my Python program to get it right.  I added in the QUOTE_STRINGS rule for completeness.  I think these two new rules would be useful for anyone wanting to create sparse CSV files.
msg232677 - (view) Author: Skip Montanaro (skip.montanaro) * (Python triager) Date: 2014-12-15 19:52
If I understand correctly, your software needs to distinguish between

# wrote ["foo", "", 42, None] with quote_all in effect
"foo","","42",""

and

# wrote ["foo", None, 42, ""] with quote_nonnull in effect
"foo",,"42",""

so you in effect want to transmit some type information through a CSV file?
msg232681 - (view) Author: Samwyse (samwyse) * Date: 2014-12-15 21:17
Skip, I don't have any visibility into how the Java program I'm feeding data into works, I'm just trying to replicate the csv files that it exports as accurately as possible.  It has several other quirks, but I can replicate all of them using Dialects; this is the only "feature" I can't.  The files I'm looking at have quoted strings and numbers, but there aren't any quoted empty strings.  I'm using a DictWriter to create similar csv files, where missing keys are treated as values of None, so I'd like those printed without quotes.  If we also want to print empty strings without quotes, that wouldn't impact me at all.

Besides my selfish needs, this could be useful to anyone wanting to reduce the save of csv files that have lots of empty fields, but wants to quote all non-empty values.  This may be an empty set, I don't know.
msg261141 - (view) Author: Skip Montanaro (skip.montanaro) * (Python triager) Date: 2016-03-02 23:06
Thanks for the update berker.peksag. I'm still not convinced that the csv module should be modified just so one user (sorry samwyse) can match the input format of someone's Java program. It seems a bit like trying to make the csv module type-sensitive. What happens when someone finds a csv file containing timestamps in a format other than the datetime.datetime object will produce by default? Why is None special as an object where bool(obj) is False?

I think the better course here is to either:

* subclass csv.DictWriter, use dictionaries as your element type, and have its writerow method do the application-specific work.

* define a writerow() function which does something similar (essentially wrapping csv.writerow()).

If someone else thinks this is something which belongs in Python's csv module, feel free to reopen and assign it to yourself.
msg261146 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-03-03 05:37
The csv module is already type-sensitive (with QUOTE_NONNUMERIC). I agree, that we shouldn't modify the csv module just for one user and one program.

If a standard CVS library in Java (or other popular laguages) differentiates between empty string and null value when read from CSV, it would be a serious argument to support this in Python. Quick search don't find this.
msg261147 - (view) Author: Berker Peksag (berker.peksag) * (Python committer) Date: 2016-03-03 07:09
I was thinking adding a more flexible API like:

    ...
    spamwriter = csv.writer(csvfile, quoting_callable=lambda field: field is not None)
    ...

But that would require too much change in the csv module (or at least its implementation wouldn't be trivial).

I agree that subclassing DictWriter is a much better way to achieve this.
msg341460 - (view) Author: (tegdev) Date: 2019-05-05 17:19
The correct handling of None values belongs to the csv module.

There is a use case to migrate a DB2 database to PostgreSQL.
DB2 has a command line tool "db2 export ..." which produces csv-files.
A row 
  ['Hello', null, 'world'] 
is exported to
  "Hello,,"world".

I would like to read in these exports with python and put it to PostgreSQL.

But with the csv library I can't read it in correctly. The input is converted to:
  ['Hello', '', 'world']
It should read as:
  ['Hello', None, 'world']

It is pretty easy to write a correct CSV reader with ANTLR but it's terribly slow.

And last but not least: if someone writes a list the reading should the identity.
Thats not True for the csv libraray.

Example:

import csv
hello_out_lst = ['Hello', None, 'world']
with open('hello.csv', 'w') as ofh:
    writer = csv.writer(ofh, delimiter=',')
    writer.writerow(hello_out_lst)

with open('hello.csv', 'r') as ifh:
    reader = csv.reader(ifh, delimiter=',')
    for row in reader:
        hello_in_lst = row

is_equal = hello_out_lst == hello_in_lst
print(f'{hello_out_lst} is equal {hello_in_lst} ?  {is_equal}')

The result is:
['Hello', None, 'world'] is equal ['Hello', '', 'world'] ?  False
msg358461 - (view) Author: Yoong Hor Meng (yoonghm) Date: 2019-12-16 02:20
There is a real requirement for csv to handle an empty field vs a empty string """".  csv.QUOTE_NOTNULL could be useful.
msg396621 - (view) Author: Alexandre Morignot (erdnaxeli) Date: 2021-06-28 09:23
I have another use case. I want to import data into PostgreSQL using the COPY FROM command. This command can read a CSV input and it needs to distinguish empty string from null values.

Could we reconsider this issue and the proposed solution?
msg396641 - (view) Author: Skip Montanaro (skip.montanaro) * (Python triager) Date: 2021-06-28 15:57
Okay, I'll reopen this, at least for the discussion of QUOTE_NONNULL. @erdnaxeli please given an example of how PostgreSQL distinguishes between the empty string and None cases. Is it a quoted empty string vs an empty field? If so, modifying @samwyse's original example, is this what you are after?

>>> csv.register_dialect('quote_notnull', quoting=csv.QUOTE_NOTNULL)
>>> csv.writer(sys.stdout, dialect='quote_notnull').writerow(['', None, 42])
"",,"42"

? Can you modify the original two patches to restrict to QUOTE_NONNULL?
msg396642 - (view) Author: Skip Montanaro (skip.montanaro) * (Python triager) Date: 2021-06-28 15:58
Missed tweaking a couple settings.
msg396643 - (view) Author: Skip Montanaro (skip.montanaro) * (Python triager) Date: 2021-06-28 16:00
Ugh... s/QUOTE_NONNULL/QUOTE_NOTNULL/

Not, Non, None... Perl would treat them all the same, right? <wink>
msg401607 - (view) Author: Miha Šetina (msetina) Date: 2021-09-10 18:28
The MS Synapse has a CSV support in its COPY command that would benefit from the proposed csv.QUOTE_NOTNULL as it can be used when preparing data for import. As they reference the same RFC the CSV modul should support, both the proposed modifications would extend the RFC functionality with support for NULL/None values in the data.
The csv.QUOTE_STRINGS would enable a smaller file size.
Our case is that we are moving data into Synapse and have to resort to risk prone string replace functionality to provide the MS Synapse COPY a proper support for empty string and NULL
msg401608 - (view) Author: Skip Montanaro (skip.montanaro) * (Python triager) Date: 2021-09-10 18:35
Update version - too late for anything older than 3.11.
msg405951 - (view) Author: Skip Montanaro (skip.montanaro) * (Python triager) Date: 2021-11-08 14:02
Note to @samwyse and @krypten: I updated the patches and created a pull request on GitHub, but I have no way of knowing if at least krypten has signed a CLA for Python. Since you're the author of the original patches, we need to verify that you have.
msg406013 - (view) Author: Skip Montanaro (skip.montanaro) * (Python triager) Date: 2021-11-09 12:06
Further question... All the discussion has been on the writer side of the csv module. Is there any reason that using QUOTE_STRINGS or QUOTE_NOTNULL should have an effect when reading? For example, should this line on input

"",,1,'a'

produce this list

["", None, "1", "a"]

with QUOTE_NOTNULL in effect, and

["", "", 1, "a"]

or

["", None, 1, "a"]

with QUOTE_STRINGS in effect?
msg406015 - (view) Author: Miha Šetina (msetina) Date: 2021-11-09 12:33
I would say that it should follow a rule:
both cases values of None are output as an empty field

so empty filed should map to None with both 
QUOTE_NOTNULL and 
QUOTE_STRINGS

that would make:
["", None, 1, "a"]

for QUOTE_STRINGS in effect.
This would also make it possible to have circular usage. Export with one   quoting rule should be able to read with the same rule and produce same data.
msg406017 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2021-11-09 12:51
The quoting style affects not only the CSV formatting, but also the CSV parsing. How QUOTE_NOTNULL and QUOTE_STRINGS will affect parsing?
History
Date User Action Args
2021-11-09 12:51:14serhiy.storchakasetmessages: + msg406017
2021-11-09 12:33:11msetinasetmessages: + msg406015
2021-11-09 12:06:46skip.montanarosetmessages: + msg406013
2021-11-08 14:02:34skip.montanarosetmessages: + msg405951
2021-11-08 13:57:11skip.montanarosetstage: needs patch -> patch review
pull_requests: + pull_request27722
2021-09-10 18:35:17skip.montanarosetmessages: + msg401608
versions: - Python 3.6, Python 3.7, Python 3.8, Python 3.9, Python 3.10
2021-09-10 18:28:11msetinasetnosy: + msetina

messages: + msg401607
versions: + Python 3.6, Python 3.7, Python 3.8, Python 3.9, Python 3.10
2021-06-28 16:00:31skip.montanarosetmessages: + msg396643
2021-06-28 15:58:26skip.montanarosetresolution: rejected ->
stage: resolved -> needs patch
messages: + msg396642
versions: + Python 3.11, - Python 3.8
2021-06-28 15:57:52skip.montanarosetstatus: closed -> open

messages: + msg396641
2021-06-28 09:23:46erdnaxelisetnosy: + erdnaxeli
messages: + msg396621
2019-12-16 02:20:00yoonghmsetnosy: + yoonghm

messages: + msg358461
versions: + Python 3.8, - Python 3.6
2019-05-05 17:19:38tegdevsetnosy: + tegdev
messages: + msg341460
2016-03-03 07:09:55berker.peksagsetstatus: open -> closed

messages: + msg261147
stage: patch review -> resolved
2016-03-03 05:37:28serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg261146
2016-03-02 23:06:14skip.montanarosetresolution: rejected
messages: + msg261141
2016-02-28 05:13:01berker.peksagsetnosy: + berker.peksag
stage: needs patch -> patch review

versions: + Python 3.6, - Python 3.5
2014-12-15 21:17:02samwysesetmessages: + msg232681
2014-12-15 19:52:44skip.montanarosetmessages: + msg232677
2014-12-15 19:40:10samwysesetmessages: + msg232676
2014-12-14 05:45:01kryptensetfiles: + issue23041_test.patch
2014-12-14 05:14:55kryptensetfiles: + issue23041.patch

nosy: + krypten
messages: + msg232630

keywords: + patch
2014-12-13 13:05:03skip.montanarosetmessages: + msg232607
2014-12-13 06:46:50rhettingersetassignee: skip.montanaro

messages: + msg232596
nosy: + rhettinger, skip.montanaro
2014-12-12 17:27:41samwysesetmessages: + msg232563
2014-12-12 16:52:33r.david.murraysetversions: - Python 2.7, Python 3.2, Python 3.3, Python 3.4, Python 3.6
nosy: + r.david.murray

messages: + msg232561

keywords: + easy
stage: needs patch
2014-12-12 16:36:45samwysecreate