Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

csv writer needs more quoting rules #67230

Closed
samwyse mannequin opened this issue Dec 12, 2014 · 60 comments
Closed

csv writer needs more quoting rules #67230

samwyse mannequin opened this issue Dec 12, 2014 · 60 comments
Assignees
Labels
3.13 new features, bugs and security fixes stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@samwyse
Copy link
Mannequin

samwyse mannequin commented Dec 12, 2014

BPO 23041
Nosy @smontanaro, @rhettinger, @bitdancer, @berkerpeksag, @serhiy-storchaka, @yoonghm, @erdnaxeli, @msetina
PRs
  • gh-67230: add quoting rules to csv module #29469
  • Files
  • issue23041.patch: Patch for resolving issue 23041. According to message 232560 and 232563
  • issue23041_test.patch: Test cases added for testing as behaviour proposed in message 232560
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/smontanaro'
    closed_at = None
    created_at = <Date 2014-12-12.16:36:45.576>
    labels = ['easy', 'type-feature', 'library', '3.11']
    title = 'csv needs more quoting rules'
    updated_at = <Date 2022-02-24.00:07:17.586>
    user = 'https://bugs.python.org/samwyse'

    bugs.python.org fields:

    activity = <Date 2022-02-24.00:07:17.586>
    actor = 'samwyse'
    assignee = 'skip.montanaro'
    closed = False
    closed_date = None
    closer = None
    components = ['Library (Lib)']
    creation = <Date 2014-12-12.16:36:45.576>
    creator = 'samwyse'
    dependencies = []
    files = ['37444', '37445']
    hgrepos = []
    issue_num = 23041
    keywords = ['patch', 'easy']
    message_count = 26.0
    messages = ['232560', '232561', '232563', '232596', '232607', '232630', '232676', '232677', '232681', '261141', '261146', '261147', '341460', '358461', '396621', '396641', '396642', '396643', '401607', '401608', '405951', '406013', '406015', '406017', '412463', '413867']
    nosy_count = 11.0
    nosy_names = ['skip.montanaro', 'rhettinger', 'samwyse', 'r.david.murray', 'berker.peksag', 'serhiy.storchaka', 'krypten', 'yoonghm', 'tegdev', 'erdnaxeli', 'msetina']
    pr_nums = ['29469']
    priority = 'normal'
    resolution = None
    stage = 'patch review'
    status = 'open'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue23041'
    versions = ['Python 3.11']

    Linked PRs

    @samwyse
    Copy link
    Mannequin Author

    samwyse mannequin commented Dec 12, 2014

    The csv module currently implements four quoting rules for dialects: QUOTE_MINIMAL, QUOTE_ALL, QUOTE_NONNUMERIC and QUOTE_NONE. These rules treat values of None the same as an empty string, i.e. by outputting two consecutive quotes. I propose the addition of two new rules, QUOTE_NOTNULL and QUOTE_STRINGS. The former behaves like QUOTE_ALL while the later behaves like QUOTE_NONNUMERIC, except that in both cases values of None are output as an empty field. Examples follow.

    Current behavior (which will remain unchanged)

    >>> csv.register_dialect('quote_all', quoting=csv.QUOTE_ALL)
    >>> csv.writer(sys.stdout, dialect='quote_all').writerow(['foo', None, 42])
    "foo","","42"
    
    >>> csv.register_dialect('quote_nonnumeric', quoting=csv.QUOTE_NONNUMERIC)
    >>> csv.writer(sys.stdout, dialect='quote_nonnumeric').writerow(['foo', None, 42])
    "foo","",42

    Proposed behavior

    >>> csv.register_dialect('quote_notnull', quoting=csv.QUOTE_NOTNULL)
    >>> csv.writer(sys.stdout, dialect='quote_notnull').writerow(['foo', None, 42])
    "foo",,"42"
    
    >>> csv.register_dialect('quote_strings', quoting=csv.QUOTE_STRINGS)
    >>> csv.writer(sys.stdout, dialect='quote_strings').writerow(['foo', None, 42])
    "foo",,42

    @samwyse samwyse mannequin added stdlib Python modules in the Lib dir type-feature A feature request or enhancement labels Dec 12, 2014
    @bitdancer
    Copy link
    Member

    As an enhancement, this could be added only to 3.5. The proposal sounds reasonable to me.

    @bitdancer bitdancer added the easy label Dec 12, 2014
    @samwyse
    Copy link
    Mannequin Author

    samwyse mannequin commented Dec 12, 2014

    David: That's not a problem for me.

    Sorry I can't provide real patches, but I'm not in a position to compile (much less test) the C implementation of _csv. I've looked at the code online and below are the changes that I think need to be made. My use cases don't require special handing when reading empty fields, so the only changes I've made are to the code for writers. I did verify that the reader code mostly only checks for QUOTE_NOTNULL when parsing. This means that completely empty fields will continue to load as zero-length strings, not None. I won't stand in the way of anyone wanting to "fix" that for these new rules.

    typedef enum {
    QUOTE_MINIMAL, QUOTE_ALL, QUOTE_NONNUMERIC, QUOTE_NONE,
    QUOTE_STRINGS, QUOTE_NOTNULL
    } QuoteStyle;

    static StyleDesc quote_styles[] = {
    { QUOTE_MINIMAL, "QUOTE_MINIMAL" },
    { QUOTE_ALL, "QUOTE_ALL" },
    { QUOTE_NONNUMERIC, "QUOTE_NONNUMERIC" },
    { QUOTE_NONE, "QUOTE_NONE" },
    { QUOTE_STRINGS, "QUOTE_STRINGS" },
    { QUOTE_NOTNULL, "QUOTE_NOTNULL" },
    { 0 }
    };

            switch (dialect->quoting) {
            case QUOTE_NONNUMERIC:
                quoted = !PyNumber_Check(field);
                break;
            case QUOTE_ALL:
                quoted = 1;
                break;
            case QUOTE_STRINGS:
                quoted = PyString_Check(field);
                break;
            case QUOTE_NOTNULL:
                quoted = field != Py_None;
                break;
            default:
                quoted = 0;
                break;
            }

    " csv.QUOTE_MINIMAL means only when required, for example, when a\n"
    " field contains either the quotechar or the delimiter\n"
    " csv.QUOTE_ALL means that quotes are always placed around fields.\n"
    " csv.QUOTE_NONNUMERIC means that quotes are always placed around\n"
    " fields which do not parse as integers or floating point\n"
    " numbers.\n"
    " csv.QUOTE_STRINGS means that quotes are always placed around\n"
    " fields which are strings. Note that the Python value None\n"
    " is not a string.\n"
    " csv.QUOTE_NOTNULL means that quotes are only placed around fields\n"
    " that are not the Python value None.\n"

    @rhettinger
    Copy link
    Contributor

    Samwyse, are these suggestions just based on ideas of what could be done or have you encountered real-world CSV data exchanges that couldn't be handled by the CSV module?

    @smontanaro
    Copy link
    Contributor

    It doesn't look like a difficult change, but is it really needed? I guess my reaction is the same as Raymond's. Are there real-world uses where the current set of quoting styles isn't sufficient?

    @krypten
    Copy link
    Mannequin

    krypten mannequin commented Dec 14, 2014

    Used function PyUnicode_Check instead of PyString_Check

    @samwyse
    Copy link
    Mannequin Author

    samwyse mannequin commented Dec 15, 2014

    Yes, it's based on a real-world need. I work for a Fortune 500 company and we have an internal tool that exports CSV files using what I've described as the QUOTE_NOTNULL rules. I need to create similar files for re-importation. Right now, I have to post-process the output of my Python program to get it right. I added in the QUOTE_STRINGS rule for completeness. I think these two new rules would be useful for anyone wanting to create sparse CSV files.

    @smontanaro
    Copy link
    Contributor

    If I understand correctly, your software needs to distinguish between

    # wrote ["foo", "", 42, None] with quote_all in effect
    "foo","","42",""

    and

    # wrote ["foo", None, 42, ""] with quote_nonnull in effect
    "foo",,"42",""

    so you in effect want to transmit some type information through a CSV file?

    @samwyse
    Copy link
    Mannequin Author

    samwyse mannequin commented Dec 15, 2014

    Skip, I don't have any visibility into how the Java program I'm feeding data into works, I'm just trying to replicate the csv files that it exports as accurately as possible. It has several other quirks, but I can replicate all of them using Dialects; this is the only "feature" I can't. The files I'm looking at have quoted strings and numbers, but there aren't any quoted empty strings. I'm using a DictWriter to create similar csv files, where missing keys are treated as values of None, so I'd like those printed without quotes. If we also want to print empty strings without quotes, that wouldn't impact me at all.

    Besides my selfish needs, this could be useful to anyone wanting to reduce the save of csv files that have lots of empty fields, but wants to quote all non-empty values. This may be an empty set, I don't know.

    @smontanaro
    Copy link
    Contributor

    Thanks for the update berker.peksag. I'm still not convinced that the csv module should be modified just so one user (sorry samwyse) can match the input format of someone's Java program. It seems a bit like trying to make the csv module type-sensitive. What happens when someone finds a csv file containing timestamps in a format other than the datetime.datetime object will produce by default? Why is None special as an object where bool(obj) is False?

    I think the better course here is to either:

    • subclass csv.DictWriter, use dictionaries as your element type, and have its writerow method do the application-specific work.

    • define a writerow() function which does something similar (essentially wrapping csv.writerow()).

    If someone else thinks this is something which belongs in Python's csv module, feel free to reopen and assign it to yourself.

    @serhiy-storchaka
    Copy link
    Member

    The csv module is already type-sensitive (with QUOTE_NONNUMERIC). I agree, that we shouldn't modify the csv module just for one user and one program.

    If a standard CVS library in Java (or other popular laguages) differentiates between empty string and null value when read from CSV, it would be a serious argument to support this in Python. Quick search don't find this.

    @berkerpeksag
    Copy link
    Member

    I was thinking adding a more flexible API like:

    ...
    spamwriter = csv.writer(csvfile, quoting_callable=lambda field: field is not None)
    ...
    

    But that would require too much change in the csv module (or at least its implementation wouldn't be trivial).

    I agree that subclassing DictWriter is a much better way to achieve this.

    @tegdev
    Copy link
    Mannequin

    tegdev mannequin commented May 5, 2019

    The correct handling of None values belongs to the csv module.

    There is a use case to migrate a DB2 database to PostgreSQL.
    DB2 has a command line tool "db2 export ..." which produces csv-files.
    A row
    ['Hello', null, 'world']
    is exported to
    "Hello,,"world".

    I would like to read in these exports with python and put it to PostgreSQL.

    But with the csv library I can't read it in correctly. The input is converted to:
    ['Hello', '', 'world']
    It should read as:
    ['Hello', None, 'world']

    It is pretty easy to write a correct CSV reader with ANTLR but it's terribly slow.

    And last but not least: if someone writes a list the reading should the identity.
    Thats not True for the csv libraray.

    Example:

    import csv
    hello_out_lst = ['Hello', None, 'world']
    with open('hello.csv', 'w') as ofh:
        writer = csv.writer(ofh, delimiter=',')
        writer.writerow(hello_out_lst)
    
    with open('hello.csv', 'r') as ifh:
        reader = csv.reader(ifh, delimiter=',')
        for row in reader:
            hello_in_lst = row
    
    is_equal = hello_out_lst == hello_in_lst
    print(f'{hello_out_lst} is equal {hello_in_lst} ?  {is_equal}')

    The result is:
    ['Hello', None, 'world'] is equal ['Hello', '', 'world'] ? False

    @yoonghm
    Copy link
    Mannequin

    yoonghm mannequin commented Dec 16, 2019

    There is a real requirement for csv to handle an empty field vs a empty string """". csv.QUOTE_NOTNULL could be useful.

    @yoonghm yoonghm mannequin added the 3.8 only security fixes label Dec 16, 2019
    @erdnaxeli
    Copy link
    Mannequin

    erdnaxeli mannequin commented Jun 28, 2021

    I have another use case. I want to import data into PostgreSQL using the COPY FROM command. This command can read a CSV input and it needs to distinguish empty string from null values.

    Could we reconsider this issue and the proposed solution?

    @smontanaro
    Copy link
    Contributor

    Okay, I'll reopen this, at least for the discussion of QUOTE_NONNULL. @erdnaxeli please given an example of how PostgreSQL distinguishes between the empty string and None cases. Is it a quoted empty string vs an empty field? If so, modifying @samwyse's original example, is this what you are after?

    >>> csv.register_dialect('quote_notnull', quoting=csv.QUOTE_NOTNULL)
    >>> csv.writer(sys.stdout, dialect='quote_notnull').writerow(['', None, 42])
    "",,"42"

    ? Can you modify the original two patches to restrict to QUOTE_NONNULL?

    @smontanaro smontanaro reopened this Jun 28, 2021
    @smontanaro
    Copy link
    Contributor

    Missed tweaking a couple settings.

    @smontanaro smontanaro added 3.11 only security fixes and removed 3.8 only security fixes labels Jun 28, 2021
    @smontanaro
    Copy link
    Contributor

    Ugh... s/QUOTE_NONNULL/QUOTE_NOTNULL/

    Not, Non, None... Perl would treat them all the same, right? <wink>

    @msetina
    Copy link
    Mannequin

    msetina mannequin commented Sep 10, 2021

    The MS Synapse has a CSV support in its COPY command that would benefit from the proposed csv.QUOTE_NOTNULL as it can be used when preparing data for import. As they reference the same RFC the CSV modul should support, both the proposed modifications would extend the RFC functionality with support for NULL/None values in the data.
    The csv.QUOTE_STRINGS would enable a smaller file size.
    Our case is that we are moving data into Synapse and have to resort to risk prone string replace functionality to provide the MS Synapse COPY a proper support for empty string and NULL

    @msetina msetina mannequin added 3.7 (EOL) end of life 3.8 only security fixes labels Sep 10, 2021
    @smontanaro
    Copy link
    Contributor

    Python's csv module always returns strings. The Pandas CSV file read does infer types though.

    % cat names.csv
    first_name,last_name,age
    Eric,Idle,47
    John,Cleese,99
    % python
    Python 3.12.0a7+ (heads/main-dirty:330a942b63, Apr 14 2023, 06:59:05) [Clang 14.0.3 (clang-1403.0.22.14.1)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import csv
    >>> for row in csv.reader(open("names.csv")):
    ...   print(row)
    ... 
    ['first_name', 'last_name', 'age']
    ['Eric', 'Idle', '47']
    ['John', 'Cleese', '99']
    >>> ^D
    

    @msetina
    Copy link

    msetina commented Apr 16, 2023

    If I can add my pinch of salt. Here we colide data and programming world. In data world None is a value of a type. In Python None is a separate type.
    Using a paradigm from data world Oracle will equate empty string in strings with None. MS would not.
    On the reader side if you have a CSV to read and you want to preserve NULL values normally in non quoted there would be a special character combination (^N or something) to represent a NULL, so reader would be able to parse it.
    Then came Microsoft where someone was not checking with others and used quoted values which are more expensive and thought that it was a good idea to use empty field for None. And for this a reader needs to infere None.
    In Python using non quoted values in CSV is good, but there is an if, when users encounter MS produced CSV and want None values preserved. Maybe MS is not the only one, but they stepped on my toe.

    @smontanaro
    Copy link
    Contributor

    I certainly understand the desire to interoperate with the MS CSV ecosystem, as that has traditionally been the de facto "spec" for what it means to read and write CSV files. Still, I think the topic would be too involved to deal with properly between now and May 1. I propose limiting the change for now to just the writer, fixing any doc bits which suggest there are any changes to the reader, then going through the usual PEP process to investigate making changes to the semantics of the reader.

    aisk pushed a commit to aisk/cpython that referenced this issue Apr 18, 2023
    Add two quoting styles for csv dialects.
    They will help to work with certain databases in particular.
    
    Automerge-Triggered-By: GH:merwok
    @smontanaro
    Copy link
    Contributor

    Would someone like to work on a PEP regarding type conversion in the reader?

    carljm added a commit to carljm/cpython that referenced this issue Apr 20, 2023
    * main: (24 commits)
      pythongh-98040: Move the Single-Phase Init Tests Out of test_imp (pythongh-102561)
      pythongh-83861: Fix datetime.astimezone() method (pythonGH-101545)
      pythongh-102856: Clean some of the PEP 701 tokenizer implementation (python#103634)
      pythongh-102856: Skip test_mismatched_parens in WASI builds (python#103633)
      pythongh-102856: Initial implementation of PEP 701 (python#102855)
      pythongh-103583: Add ref. dependency between multibytecodec modules (python#103589)
      pythongh-83004: Harden msvcrt further (python#103420)
      pythonGH-88342: clarify that `asyncio.as_completed` accepts generators yielding tasks (python#103626)
      pythongh-102778: IDLE - make sys.last_exc available in Shell after traceback (python#103314)
      pythongh-103582: Remove last references to `argparse.REMAINDER` from docs (python#103586)
      pythongh-103583: Always pass multibyte codec structs as const (python#103588)
      pythongh-103617: Fix compiler warning in _iomodule.c (python#103618)
      pythongh-103596: [Enum] do not shadow mixed-in methods/attributes (pythonGH-103600)
      pythonGH-100530: Change the error message for non-class class patterns (pythonGH-103576)
      pythongh-95299: Remove lingering setuptools reference in installer scripts (pythonGH-103613)
      [Doc] Fix a typo in optparse.rst (python#103504)
      pythongh-101100: Fix broken reference `__format__` in `string.rst` (python#103531)
      pythongh-95299: Stop installing setuptools as a part of ensurepip and venv (python#101039)
      pythonGH-103484: Docs: add linkcheck allowed redirects entries for most cases (python#103569)
      pythongh-67230: update whatsnew note for csv changes (python#103598)
      ...
    @hugovk hugovk removed the easy label Sep 8, 2023
    @erlend-aasland erlend-aasland added 3.13 new features, bugs and security fixes and removed 3.12 bugs and security fixes labels Jan 5, 2024
    @erlend-aasland
    Copy link
    Contributor

    I'll try to summarise the state of this issue:

    The proposed feature was to add more quoting rules to the csv writer. This has been implemented (with docs and tests)1. @serhiy-storchaka requested in #67230 (comment) that the issue be held open:

    Wait, #29469 only changes writer, not reader. There are no tests for reader with new quoting styles, so we cannot be sure that nothing is broken. [...]

    @samwyse remarked in #67230 (comment):

    [..] I can't really see an easy way [for csv reader] to distinguish between «,,» and «,"",». So unless someone more knowledgeable than I thinks that it's simple, I propose that we punt this down the road, making the two new quoting rules illegal for reader objects for the time being. [...]

    @smontanaro #67230 (comment):

    Thinking more about the problem, I think it's wrong to do anything to the reader. In its entire history, the reader has never done any sort of type inference. That would be exactly what distinguishing an empty field from a quoted empty field would be, a (small) bit of type inference. [...]

    The better course at this point would be to adjust the docs to state that the two new quoting rules have no effect in the reader. [...]

    @smontanaro #67230 (comment):

    Would someone like to work on a PEP regarding type conversion in the reader?

    I suggest to close this as completed (the csv writer feature was implemented) and to open up a new issue regarding what to do with the csv reader.

    Footnotes

    1. we should amend the issue title to reflect that the feature request was for the csv writer only

    @erlend-aasland erlend-aasland changed the title csv needs more quoting rules csv writer needs more quoting rules Jan 5, 2024
    @serhiy-storchaka
    Copy link
    Member

    the reader has never done any sort of type inference.

    But it does.

    >>> next(csv.reader(['123,"123"']))
    ['123', '123']
    >>> next(csv.reader(['123,"123"'], quoting=csv.QUOTE_NONNUMERIC))
    [123.0, '123']
    

    If quoting style is QUOTE_NONNUMERIC, it returns non-quoted field as numeric. I expect the same for None.

    @erlend-aasland
    Copy link
    Contributor

    As I already wrote, I suggest to summarise the needed reader changes in a new issue.

    @smontanaro
    Copy link
    Contributor

    the reader has never done any sort of type inference.

    But it does.

    >>> next(csv.reader(['123,"123"']))
    ['123', '123']
    >>> next(csv.reader(['123,"123"'], quoting=csv.QUOTE_NONNUMERIC))
    [123.0, '123']
    

    If quoting style is QUOTE_NONNUMERIC, it returns non-quoted field as numeric. I expect the same for None.

    That's not really inferring the type in my book. It's effectively being told by the programmer that unquoted numeric strings are to be returned as numbers. If it was inferring the type, your first example would also spit out a float as the first value.

    @serhiy-storchaka
    Copy link
    Member

    Right, and I suggest that QUOTE_NOTNULL is told by the programmer that unquoted empty strings are to be returned as None.

    @erlend-aasland
    Copy link
    Contributor

    Right, and I suggest that QUOTE_NOTNULL is told by the programmer that unquoted empty strings are to be returned as None.

    Can you please suggest this csv reader change in a new issue?

    @serhiy-storchaka
    Copy link
    Member

    It is perhaps too late to make this change in 3.12, so we can open a new issue.

    @serhiy-storchaka
    Copy link
    Member

    Opened #113732.

    terryjreedy pushed a commit that referenced this issue Feb 1, 2024
    …114816)
    
    As @GPHemsley pointed out, #29469 omitted `versionadded` notes for the 2 new items.
    miss-islington pushed a commit to miss-islington/cpython that referenced this issue Feb 1, 2024
    …RINGS (pythonGH-114816)
    
    As @GPHemsley pointed out, pythonGH-29469 omitted `versionadded` notes for the 2 new items.
    (cherry picked from commit 586057e)
    
    Co-authored-by: Skip Montanaro <skip.montanaro@gmail.com>
    terryjreedy pushed a commit that referenced this issue Feb 1, 2024
    …TRINGS (GH-114816) (#114840)
    
    As @GPHemsley pointed out, GH-29469 omitted `versionadded` notes for the 2 new items.
    (cherry picked from commit 586057e)
    
    Co-authored-by: Skip Montanaro <skip.montanaro@gmail.com>
    aisk pushed a commit to aisk/cpython that referenced this issue Feb 11, 2024
    …RINGS (python#114816)
    
    As @GPHemsley pointed out, python#29469 omitted `versionadded` notes for the 2 new items.
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.13 new features, bugs and security fixes stdlib Python modules in the Lib dir type-feature A feature request or enhancement
    Projects
    Status: Done
    Development

    No branches or pull requests