classification
Title: csv needs more quoting rules
Type: enhancement Stage: needs patch
Components: Library (Lib) Versions: Python 3.11
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: skip.montanaro Nosy List: berker.peksag, erdnaxeli, krypten, r.david.murray, rhettinger, samwyse, serhiy.storchaka, skip.montanaro, tegdev, yoonghm
Priority: normal Keywords: easy, patch

Created on 2014-12-12 16:36 by samwyse, last changed 2021-06-28 16:00 by skip.montanaro.

Files
File name Uploaded Description Edit
issue23041.patch krypten, 2014-12-14 05:14 Patch for resolving issue 23041. According to message 232560 and 232563 review
issue23041_test.patch krypten, 2014-12-14 05:45 Test cases added for testing as behaviour proposed in message 232560 review
Messages (18)
msg232560 - (view) Author: Samwyse (samwyse) Date: 2014-12-12 16:36
The csv module currently implements four quoting rules for dialects: QUOTE_MINIMAL, QUOTE_ALL, QUOTE_NONNUMERIC and QUOTE_NONE.  These rules treat values of None the same as an empty string, i.e. by outputting two consecutive quotes.  I propose the addition of two new rules, QUOTE_NOTNULL and QUOTE_STRINGS.  The former behaves like QUOTE_ALL while the later behaves like QUOTE_NONNUMERIC, except that in both cases values of None are output as an empty field.  Examples follow.


Current behavior (which will remain unchanged)

>>> csv.register_dialect('quote_all', quoting=csv.QUOTE_ALL)
>>> csv.writer(sys.stdout, dialect='quote_all').writerow(['foo', None, 42])
"foo","","42"

>>> csv.register_dialect('quote_nonnumeric', quoting=csv.QUOTE_NONNUMERIC)
>>> csv.writer(sys.stdout, dialect='quote_nonnumeric').writerow(['foo', None, 42])
"foo","",42


Proposed behavior

>>> csv.register_dialect('quote_notnull', quoting=csv.QUOTE_NOTNULL)
>>> csv.writer(sys.stdout, dialect='quote_notnull').writerow(['foo', None, 42])
"foo",,"42"

>>> csv.register_dialect('quote_strings', quoting=csv.QUOTE_STRINGS)
>>> csv.writer(sys.stdout, dialect='quote_strings').writerow(['foo', None, 42])
"foo",,42
msg232561 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-12-12 16:52
As an enhancement, this could be added only to 3.5.  The proposal sounds reasonable to me.
msg232563 - (view) Author: Samwyse (samwyse) Date: 2014-12-12 17:27
David:  That's not a problem for me.

Sorry I can't provide real patches, but I'm not in a position to compile (much less test) the C implementation of _csv.  I've looked at the code online and below are the changes that I think need to be made.  My use cases don't require special handing when reading empty fields, so the only changes I've made are to the code for writers.  I did verify that the reader code mostly only checks for QUOTE_NOTNULL when parsing.  This means that completely empty fields will continue to load as zero-length strings, not None.  I won't stand in the way of anyone wanting to "fix" that for these new rules.



typedef enum {
    QUOTE_MINIMAL, QUOTE_ALL, QUOTE_NONNUMERIC, QUOTE_NONE,
    QUOTE_STRINGS, QUOTE_NOTNULL
} QuoteStyle;



static StyleDesc quote_styles[] = {
    { QUOTE_MINIMAL,    "QUOTE_MINIMAL" },
    { QUOTE_ALL,        "QUOTE_ALL" },
    { QUOTE_NONNUMERIC, "QUOTE_NONNUMERIC" },
    { QUOTE_NONE,       "QUOTE_NONE" },
    { QUOTE_STRINGS,    "QUOTE_STRINGS" },
    { QUOTE_NOTNULL,    "QUOTE_NOTNULL" },
    { 0 }
};



        switch (dialect->quoting) {
        case QUOTE_NONNUMERIC:
            quoted = !PyNumber_Check(field);
            break;
        case QUOTE_ALL:
            quoted = 1;
            break;
        case QUOTE_STRINGS:
            quoted = PyString_Check(field);
            break;
        case QUOTE_NOTNULL:
            quoted = field != Py_None;
            break;
        default:
            quoted = 0;
            break;
        }



"        csv.QUOTE_MINIMAL means only when required, for example, when a\n"
"            field contains either the quotechar or the delimiter\n"
"        csv.QUOTE_ALL means that quotes are always placed around fields.\n"
"        csv.QUOTE_NONNUMERIC means that quotes are always placed around\n"
"            fields which do not parse as integers or floating point\n"
"            numbers.\n"
"        csv.QUOTE_STRINGS means that quotes are always placed around\n"
"            fields which are strings.  Note that the Python value None\n"
"            is not a string.\n"
"        csv.QUOTE_NOTNULL means that quotes are only placed around fields\n"
"            that are not the Python value None.\n"
msg232596 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2014-12-13 06:46
Samwyse, are these suggestions just based on ideas of what could be done or have you encountered real-world CSV data exchanges that couldn't be handled by the CSV module?
msg232607 - (view) Author: Skip Montanaro (skip.montanaro) * (Python triager) Date: 2014-12-13 13:05
It doesn't look like a difficult change, but is it really needed? I guess my reaction is the same as Raymond's. Are there real-world uses where the current set of quoting styles isn't sufficient?
msg232630 - (view) Author: Chaitanya agrawal (krypten) * Date: 2014-12-14 05:14
Used function PyUnicode_Check instead of PyString_Check
msg232676 - (view) Author: Samwyse (samwyse) Date: 2014-12-15 19:40
Yes, it's based on a real-world need.  I work for a Fortune 500 company and we have an internal tool that exports CSV files using what I've described as the QUOTE_NOTNULL rules.  I need to create similar files for re-importation.  Right now, I have to post-process the output of my Python program to get it right.  I added in the QUOTE_STRINGS rule for completeness.  I think these two new rules would be useful for anyone wanting to create sparse CSV files.
msg232677 - (view) Author: Skip Montanaro (skip.montanaro) * (Python triager) Date: 2014-12-15 19:52
If I understand correctly, your software needs to distinguish between

# wrote ["foo", "", 42, None] with quote_all in effect
"foo","","42",""

and

# wrote ["foo", None, 42, ""] with quote_nonnull in effect
"foo",,"42",""

so you in effect want to transmit some type information through a CSV file?
msg232681 - (view) Author: Samwyse (samwyse) Date: 2014-12-15 21:17
Skip, I don't have any visibility into how the Java program I'm feeding data into works, I'm just trying to replicate the csv files that it exports as accurately as possible.  It has several other quirks, but I can replicate all of them using Dialects; this is the only "feature" I can't.  The files I'm looking at have quoted strings and numbers, but there aren't any quoted empty strings.  I'm using a DictWriter to create similar csv files, where missing keys are treated as values of None, so I'd like those printed without quotes.  If we also want to print empty strings without quotes, that wouldn't impact me at all.

Besides my selfish needs, this could be useful to anyone wanting to reduce the save of csv files that have lots of empty fields, but wants to quote all non-empty values.  This may be an empty set, I don't know.
msg261141 - (view) Author: Skip Montanaro (skip.montanaro) * (Python triager) Date: 2016-03-02 23:06
Thanks for the update berker.peksag. I'm still not convinced that the csv module should be modified just so one user (sorry samwyse) can match the input format of someone's Java program. It seems a bit like trying to make the csv module type-sensitive. What happens when someone finds a csv file containing timestamps in a format other than the datetime.datetime object will produce by default? Why is None special as an object where bool(obj) is False?

I think the better course here is to either:

* subclass csv.DictWriter, use dictionaries as your element type, and have its writerow method do the application-specific work.

* define a writerow() function which does something similar (essentially wrapping csv.writerow()).

If someone else thinks this is something which belongs in Python's csv module, feel free to reopen and assign it to yourself.
msg261146 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-03-03 05:37
The csv module is already type-sensitive (with QUOTE_NONNUMERIC). I agree, that we shouldn't modify the csv module just for one user and one program.

If a standard CVS library in Java (or other popular laguages) differentiates between empty string and null value when read from CSV, it would be a serious argument to support this in Python. Quick search don't find this.
msg261147 - (view) Author: Berker Peksag (berker.peksag) * (Python committer) Date: 2016-03-03 07:09
I was thinking adding a more flexible API like:

    ...
    spamwriter = csv.writer(csvfile, quoting_callable=lambda field: field is not None)
    ...

But that would require too much change in the csv module (or at least its implementation wouldn't be trivial).

I agree that subclassing DictWriter is a much better way to achieve this.
msg341460 - (view) Author: (tegdev) Date: 2019-05-05 17:19
The correct handling of None values belongs to the csv module.

There is a use case to migrate a DB2 database to PostgreSQL.
DB2 has a command line tool "db2 export ..." which produces csv-files.
A row 
  ['Hello', null, 'world'] 
is exported to
  "Hello,,"world".

I would like to read in these exports with python and put it to PostgreSQL.

But with the csv library I can't read it in correctly. The input is converted to:
  ['Hello', '', 'world']
It should read as:
  ['Hello', None, 'world']

It is pretty easy to write a correct CSV reader with ANTLR but it's terribly slow.

And last but not least: if someone writes a list the reading should the identity.
Thats not True for the csv libraray.

Example:

import csv
hello_out_lst = ['Hello', None, 'world']
with open('hello.csv', 'w') as ofh:
    writer = csv.writer(ofh, delimiter=',')
    writer.writerow(hello_out_lst)

with open('hello.csv', 'r') as ifh:
    reader = csv.reader(ifh, delimiter=',')
    for row in reader:
        hello_in_lst = row

is_equal = hello_out_lst == hello_in_lst
print(f'{hello_out_lst} is equal {hello_in_lst} ?  {is_equal}')

The result is:
['Hello', None, 'world'] is equal ['Hello', '', 'world'] ?  False
msg358461 - (view) Author: Yoong Hor Meng (yoonghm) Date: 2019-12-16 02:20
There is a real requirement for csv to handle an empty field vs a empty string """".  csv.QUOTE_NOTNULL could be useful.
msg396621 - (view) Author: Alexandre Morignot (erdnaxeli) Date: 2021-06-28 09:23
I have another use case. I want to import data into PostgreSQL using the COPY FROM command. This command can read a CSV input and it needs to distinguish empty string from null values.

Could we reconsider this issue and the proposed solution?
msg396641 - (view) Author: Skip Montanaro (skip.montanaro) * (Python triager) Date: 2021-06-28 15:57
Okay, I'll reopen this, at least for the discussion of QUOTE_NONNULL. @erdnaxeli please given an example of how PostgreSQL distinguishes between the empty string and None cases. Is it a quoted empty string vs an empty field? If so, modifying @samwyse's original example, is this what you are after?

>>> csv.register_dialect('quote_notnull', quoting=csv.QUOTE_NOTNULL)
>>> csv.writer(sys.stdout, dialect='quote_notnull').writerow(['', None, 42])
"",,"42"

? Can you modify the original two patches to restrict to QUOTE_NONNULL?
msg396642 - (view) Author: Skip Montanaro (skip.montanaro) * (Python triager) Date: 2021-06-28 15:58
Missed tweaking a couple settings.
msg396643 - (view) Author: Skip Montanaro (skip.montanaro) * (Python triager) Date: 2021-06-28 16:00
Ugh... s/QUOTE_NONNULL/QUOTE_NOTNULL/

Not, Non, None... Perl would treat them all the same, right? <wink>
History
Date User Action Args
2021-06-28 16:00:31skip.montanarosetmessages: + msg396643
2021-06-28 15:58:26skip.montanarosetresolution: rejected ->
stage: resolved -> needs patch
messages: + msg396642
versions: + Python 3.11, - Python 3.8
2021-06-28 15:57:52skip.montanarosetstatus: closed -> open

messages: + msg396641
2021-06-28 09:23:46erdnaxelisetnosy: + erdnaxeli
messages: + msg396621
2019-12-16 02:20:00yoonghmsetnosy: + yoonghm

messages: + msg358461
versions: + Python 3.8, - Python 3.6
2019-05-05 17:19:38tegdevsetnosy: + tegdev
messages: + msg341460
2016-03-03 07:09:55berker.peksagsetstatus: open -> closed

messages: + msg261147
stage: patch review -> resolved
2016-03-03 05:37:28serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg261146
2016-03-02 23:06:14skip.montanarosetresolution: rejected
messages: + msg261141
2016-02-28 05:13:01berker.peksagsetnosy: + berker.peksag
stage: needs patch -> patch review

versions: + Python 3.6, - Python 3.5
2014-12-15 21:17:02samwysesetmessages: + msg232681
2014-12-15 19:52:44skip.montanarosetmessages: + msg232677
2014-12-15 19:40:10samwysesetmessages: + msg232676
2014-12-14 05:45:01kryptensetfiles: + issue23041_test.patch
2014-12-14 05:14:55kryptensetfiles: + issue23041.patch

nosy: + krypten
messages: + msg232630

keywords: + patch
2014-12-13 13:05:03skip.montanarosetmessages: + msg232607
2014-12-13 06:46:50rhettingersetassignee: skip.montanaro

messages: + msg232596
nosy: + rhettinger, skip.montanaro
2014-12-12 17:27:41samwysesetmessages: + msg232563
2014-12-12 16:52:33r.david.murraysetversions: - Python 2.7, Python 3.2, Python 3.3, Python 3.4, Python 3.6
nosy: + r.david.murray

messages: + msg232561

keywords: + easy
stage: needs patch
2014-12-12 16:36:45samwysecreate