Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSV Null Byte Error #71767

Closed
bobbyocean mannequin opened this issue Jul 21, 2016 · 19 comments
Closed

CSV Null Byte Error #71767

bobbyocean mannequin opened this issue Jul 21, 2016 · 19 comments
Labels
3.11 only security fixes extension-modules C modules in the Modules dir type-bug An unexpected behavior, bug, or error

Comments

@bobbyocean
Copy link
Mannequin

bobbyocean mannequin commented Jul 21, 2016

BPO 27580
Nosy @smontanaro, @bitdancer, @encukou, @serhiy-storchaka, @remilapeyre, @danieldjewell
PRs
  • bpo-27580: Add support of null characters in csv #28808
  • Files
  • nul.csv
  • Opening_CSV.png
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2021-10-09.16:18:43.515>
    created_at = <Date 2016-07-21.05:02:48.043>
    labels = ['extension-modules', 'type-bug', '3.11']
    title = 'CSV Null Byte Error'
    updated_at = <Date 2021-11-26.12:23:14.907>
    user = 'https://bugs.python.org/bobbyocean'

    bugs.python.org fields:

    activity = <Date 2021-11-26.12:23:14.907>
    actor = 'petr.viktorin'
    assignee = 'none'
    closed = True
    closed_date = <Date 2021-10-09.16:18:43.515>
    closer = 'serhiy.storchaka'
    components = ['Extension Modules']
    creation = <Date 2016-07-21.05:02:48.043>
    creator = 'bobbyocean'
    dependencies = []
    files = ['43810', '43822']
    hgrepos = []
    issue_num = 27580
    keywords = ['patch']
    message_count = 19.0
    messages = ['270908', '270923', '270926', '270939', '270945', '270948', '270951', '270959', '270964', '370359', '370370', '370373', '370376', '370379', '370384', '403431', '403541', '407044', '407045']
    nosy_count = 7.0
    nosy_names = ['skip.montanaro', 'r.david.murray', 'petr.viktorin', 'serhiy.storchaka', 'bobbyocean', 'remi.lapeyre', 'danieljewell']
    pr_nums = ['28808']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue27580'
    versions = ['Python 3.11']

    @bobbyocean
    Copy link
    Mannequin Author

    bobbyocean mannequin commented Jul 21, 2016

    I think this has been asked before, but it has been awhile and I think needs to be re-addressed.

    Why doesn't the CSV module handle NULL bytes?

    Let me do some examples,

    ascii = [chr(n) for n in range(256)] #No errors
    print(ascii) #No errors
    print(dict(zip(ascii,ascii))) #No errors

    open('temp','r').writelines(ascii) #No errors
    f=open('temp','r')
    for line in f: print(line) #No errors
    f.close()

    Python hsndles every ascii chr, DEL, ESC, RETURN, etc. It displays those characters, uses them as keys or values, etc.

    But now try,

    import csv
    data = csv.DictReader(ascii) 
    for line in data: print(line) #NULL Byte Error

    But we can quick fix this easily,

    ascii.pop(0)
    data = csv.DictReader(ascii)
    for line in data: print(line) #No Errors

    Doesn't it seem odd that we allow the use of every chr as keys, vslues, prints, etc. but then we hold a special exception for using the csv module and yhat special exception is not the ESC or DEL or any other non-printable chr, the exception is for the NULL?

    @bobbyocean bobbyocean mannequin added type-feature A feature request or enhancement extension-modules C modules in the Modules dir labels Jul 21, 2016
    @bitdancer
    Copy link
    Member

    By "discussed before" I presume you are referring to bpo-1599055.

    It is true that have been changes since then in Python's handling of null bytes. Perhaps it is indeed time to revisit this. I'll leave that to the experts...this can be closed as a duplicate of bpo-1599055 if I'm wrong about things having possibly changed in the interim.

    @smontanaro
    Copy link
    Contributor

    Beyond whether or not the csv module can handle NUL bytes, you might figure out if Excel will. Since the CSV format isn't some sort of "standard", its operational definition has always been what Excel will produce or consume.

    I don't have Excel (not a Windows guy), but I do have Gnumeric and LibreOffice. I constructed a simple CSV file by hand which contains several NUL bytes to see what they would do. Gnumeric pops up a dialog and converts them to spaces (and then, oddly enough, doesn't think the file has been modified). LibreOffice didn't complain while loading the file, but when I saved it, it silently deleted the NULs.

    I've attached the file should anyone like to experiment with other spreadsheets.

    @bobbyocean
    Copy link
    Mannequin Author

    bobbyocean mannequin commented Jul 21, 2016

    @ Skip Montanaro,

    Actually, CSV has nothing to do with excel. Excel is a gui that processes CSV like many other programs. CSV stands for comma seperated values and is basic standard for data.https://en.m.wikipedia.org/wiki/Comma-separated_values

    As far as I know, many programs handle NULL bytes fine (either as a empty atring or a # or something else). In any case, it really should be irrelevent how some particular programs handle a CSV file.

    @bitdancer
    Copy link
    Member

    Bobby: Skip is one of the authors of the csv module, and has been maintaining it since it was added to the standard library. He knows whereof he speaks: there is no standard for csv (as noted in the article you link), and all csv parsers that want to be interoperable refer back to Microsoft's implementation when dealing with any quirks. That implementation currently is Excel.

    That said, your are right that others have adopted the format, and there is an argument to be made that we don't have to *limit* ourselves to what Microsoft supports. Although we probably don't want to be emitting stuff that they don't without being explicit about it.

    @bobbyocean
    Copy link
    Mannequin Author

    bobbyocean mannequin commented Jul 21, 2016

    I am sorry I must have mis-read the history part of that article.

    @smontanaro
    Copy link
    Contributor

    I wasn't familiar with RFC 4180. (Or, quite possibly, I forgot I used to be familiar with it.) Note that at the bottom of the BNF definition of the file structure is the definition of TEXTDATA:

    TEXTDATA = %x20-21 / %x23-2B / %x2D-7E

    That pretty explicitly excludes NUL bytes (%x00 I think).

    I'd still like someone load my nul.csv file into Excel and report back what it does (or doesn't) do with it.

    @bobbyocean
    Copy link
    Mannequin Author

    bobbyocean mannequin commented Jul 21, 2016

    I attempted to open the file in excel; raised no errors.

    In any case (regardless of Microsoft-concerns), I am glad to see a discussion started and possibly some concern that an update might be useful to the community (it would certainly cut down on the number of stack-exchange posts about this very topic). I certainly would put my vote to have csv handle the NULL byte, in the same way as python does natively.

    Thanks for you time. Oh, and since you are one of the author's, thanks for writing/working on the csv module, VERY USEFUL. :-)

    @smontanaro
    Copy link
    Contributor

    Thanks. The display you showed looks about like I saw in LibreOffice. If you export it back to another CSV file, does the new file match the original exactly, or does (like LibreOffice) it save a file without NUL bytes?

    I don't mind having the discussion, but even though we have traditionally treated CSV files as binary files in Python (at least when I was closely involved in the 2.x days), that was mostly so end-of-line sequences weren't corrupted. As others have pointed out, in 2.x Python String objects stored the data as a normal NUL-terminated pointer-to-char for efficiency when interacting with C libraries. C uses NUL as a string terminator, so we couldn't work with embedded NULs. I haven't looked at the 3.x string stuff (I know Unicode is much more intimately involved). If it still maintains that close working relationship with the typical C strings, supporting NUL bytes will be problematic.

    In cases where the underlying representation isn't quite what I want, I've been able to get away with a file wrapper which suitably mangles the input before passing it up the chain to the csv module. For example, the __next__ method of your file wrapper could delete NULs or replace them with something suitably innocuous, like "\001", or some other non-printable character you are certain won't be in the input. If you want to preserve NULs, reverse the translation during the write().

    @danieldjewell
    Copy link
    Mannequin

    danieldjewell mannequin commented May 30, 2020

    Forgive my frustration, but @Skip I really don't see how the definition of CSV relating to Excel (or Gnumeric or LibreOffice) has any relevance as to whether or not the module (and perhaps Python more generally) supports chr(0x00) as a delimiter. (Neither you nor I get to decide how someone else might write output data...)

    While the module is called CSV, it's really not just *Comma* Separated Values - rather it's a rough approximation of a database table with an optional header row where rows/records are separated by <record separator> and fields are separated by <field separator>. Sometimes the record separator is chr(0x2c) (e.g. a comma) sometimes it's chr(0x09) (e.g. a tab - or in ASCII parlance "Horizontal Tab/HT") ... or maybe even the actual ASCII "Record Separator" character (e.g. chr(0x1e)) ... or maybe NUL chr(0x00).

    (1) The module should be 100% agnostic about the separator - the current (3.8.3) error text when trying to use csv.reader(..., delimiter=chr(0x00)) is "TypeError: "delimiter" must be a 1-character string" ... well, chr(0x00) *is* a 1-character string. It's not a 1-character *printable* string... But then again neither is chr(0x1e) (ASCII "RS" Record Separator) .. and csv.reader(..., delimiter=chr(0x1e)) appears to work (I haven't tried actual data yet).

    (1a) The use of chr(0x00) or '\0' is used quite often in the *NIX world as a convenient record separator that doesn't have escaping problems because by it's very nature it's non-printable. e.g. "find . -iname "*something*" -print0 | xargs -0 <program>" ...

    As to the difficulty in handling 0x00 characters, I dunno ... it appears that GNU find, xargs, gawk... same with FreeBSD. FreeBSD writes the output for "-print0" like this: https://github.com/freebsd/freebsd/blob/508f3673dec94b03f89b9ce9569390d6d9b86a89/usr.bin/find/function.c#L1383 ... and bsd xargs handles it too. I haven't looked at the CPython source to see what's going on - it might be tricky to modify the code to support this... (but then again, IMHO, this sort of thing should have been a consideration in the first place....)

    I suppose in many ways, the very existence of this specific issue at all is just one example of what seems to be a larger issue with Python's overall development: It's a great language for *many* things and in many ways. But I've run into so many little fringe "gotchas" where something doesn't work or is limited in some way because, seemingly, functionality is designed around/defined by a practical-example-use-case and not what is or might be *possible* (e.g. the CSV-as-only-a-spreadsheet-interface example -- and I really *don't* mean that as a personal attack @Skip - I am very appreciative of the time and effort you and everyone has poured into the project...) Is it possible to write a NUL (0x00) character to a file? Through a *NIX pipe? You bet.

    (I got a little rant-y .. sorry... I'm sure there's a _lot_ more going on underneath the covers and there are a lot of factors - not limited to just the csv module - as you mentioned. I just really feel like something is "off". Maybe it's my brain - ha. :))

    @danieldjewell danieldjewell mannequin added 3.7 (EOL) end of life 3.8 only security fixes type-bug An unexpected behavior, bug, or error and removed type-feature A feature request or enhancement labels May 30, 2020
    @remilapeyre
    Copy link
    Mannequin

    remilapeyre mannequin commented May 30, 2020

    I feel like the CSV module could support using NULL as separator and could send a PR with some test to add this but Daniel could you give more information regarding why you need explicit CSV support for this?

    The CSV parser is needed to read and write CSV files to can be produced by other programs and has been extended to support the various dialects, most of them produced by spreadsheet programs.

    The use of chr(0x00) or '\0' is used quite often in the *NIX world as a convenient record separator that doesn't have escaping problems because by it's very nature it's non-printable. e.g. "find . -iname "*something*" -print0 | xargs -0 <program>" ...

    As to the difficulty in handling 0x00 characters, I dunno ... it appears that GNU find, xargs, gawk...

    Is it possible to write a NUL (0x00) character to a file? Through a *NIX pipe? You bet.

    Yes and all of those are supported natively by Python already:

    find . -type f -depth 1 -print0 | python -c "from pprint import pp; pp(input().split('\x00'))"
    

    Multilines input can be supported as easily doing:

        data = [
            line.split('\x00')
            for line in input()
        ]

    Is this not enough?

    @remilapeyre
    Copy link
    Mannequin

    remilapeyre mannequin commented May 30, 2020

    Writing to those files is obviously as easy, since like you said "because by it's very nature it's non-printable" and you will probably not find it in your data:

    with open('file', 'w') as f:
        f.write('\x00'.join(data))

    It will break if there is a NULL byte in your data, and CSV would quote the element properly instead, but so would "find . -iname "*something*" -print0 | xargs -0 <program>" if one of the file had a NULL byte in their name.

    I don't think Python is being unreasonable here, especially considering it has the same drawbacks as the other Unix utilities.
    If those solutions to read and write NULL separated files are not enough for your use-case, please give more information so that it can be supported.

    @smontanaro
    Copy link
    Contributor

    I'm sorry, but why is this issue coming up again after nearly four years? Especially without a patch? (I apologize. I've gotten a bit more grumpy as I've aged.) Let me summarize a bit of history.

    Back in the early 2000s, Dave Cole at Object Craft in Australia implemented a C-based CSV module for Python. I don't know exactly what version was his initial target, but I have a vague memory that it was Python 1.5-ish. A few years later that was cleaned up and adapted for inclusion into the Python standard library, being added to Python 2.3. I helped with some of the Python stuff and steered it into the core. Much later (Python 3.x - with the great Unicode unification), it acquired Unicode support. At the time of creation, the only "standard" for CSV files was what Excel could read and write. One of the initial requirements as I recall was that CSV files generated by the module had to be able to survive a round trip to Excel and back. As I indicated in at least one of my previous messages to this thread, knowing how Excel handles CSV files with NUL bytes would be (at least) interesting. I still think so. Can anyone test that?

    I'm not trying to suggest that gracefully handling NUL bytes wouldn't be useful in certain contexts (I also use find with -print0 routinely to preserve filenames with spaces), and if the CSV module was being written from scratch today, perhaps NUL support would be included. I am happy with the current CSV module as it exists today. I still use it routinely. Adding NUL support wouldn't scratch any itch I have. If you want NUL byte support, either as a delimiter or as a character in a cell, I think you're going to have to submit a patch. Otherwise I suggest this be closed rather than languishing for another four years.

    @smontanaro
    Copy link
    Contributor

    Looks like my comment removed Remi from the nosy list. Restoring that...

    @remilapeyre
    Copy link
    Mannequin

    remilapeyre mannequin commented May 30, 2020

    Hi Skip, I'm on a Mac so I tried Numbers, the spreadsheet tool that comes with MacOS:

    Opening and exporting to CSV again works:

    ➜ ~ cat -v ~/Documents/nul.csv
    abc;def;1234.5;^@^M
    a-c;d^@f;1234.5;^@^M
    ABC;DEF;1.5;^@

    When exporting to TSV, it makes the correct changes:

    ➜ ~ cat -v ~/Documents/nul.tsv
    abc def 1234.5 ^M
    a-c d^@f 1234.5 ^@^M
    ABC DEF 1.5 ^@

    @serhiy-storchaka
    Copy link
    Member

    It looks to me that rejecting the null character is just an implementation artifact. 0 is used both as signalling value for "not set" character parameter and as the end-of-line character in the state automata. We can use different values outside of the range of Unicode characters (0-0x10ffff). It will automatically enable support of null characters.

    @serhiy-storchaka serhiy-storchaka added 3.11 only security fixes and removed 3.7 (EOL) end of life 3.8 only security fixes labels Oct 7, 2021
    @serhiy-storchaka
    Copy link
    Member

    New changeset b454e8e by Serhiy Storchaka in branch 'main':
    bpo-27580: Add support of null characters in the csv module. (GH-28808)
    b454e8e

    @encukou
    Copy link
    Member

    encukou commented Nov 26, 2021

    The NUL check was around for a long time, available to be used (XKCD-1172 style) as a simple check against reading binary files. The change did break tests of distutils.
    Maybe it deserves a What's New entry?

    @encukou
    Copy link
    Member

    encukou commented Nov 26, 2021

    *docutils, not distutils

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    feanil added a commit to openedx/edx-platform that referenced this issue Mar 15, 2024
    In Python 3.11 CSV files are allowed to have null characters so the test
    data is no longer a valid. We update it to not have a valid unicode
    character to still test this code path correctly.
    
    I'm not concerned about the fact that the files with null will get past
    this test beacause there are other checks on the content of the file
    that catch if it doesn't have enough or the right fields so this should
    be a safe change to make to the tests.
    
    Relevant Change in Python: python/cpython#71767
    feanil added a commit to openedx/edx-platform that referenced this issue Apr 3, 2024
    In Python 3.11 CSV files are allowed to have null characters so the test
    data is no longer a valid. We update it to not have a valid unicode
    character to still test this code path correctly.
    
    I'm not concerned about the fact that the files with null will get past
    this test beacause there are other checks on the content of the file
    that catch if it doesn't have enough or the right fields so this should
    be a safe change to make to the tests.
    
    Relevant Change in Python: python/cpython#71767
    feanil added a commit to openedx/edx-platform that referenced this issue Apr 9, 2024
    In Python 3.11 CSV files are allowed to have null characters so the test
    data is no longer a valid. We update it to not have a valid unicode
    character to still test this code path correctly.
    
    I'm not concerned about the fact that the files with null will get past
    this test beacause there are other checks on the content of the file
    that catch if it doesn't have enough or the right fields so this should
    be a safe change to make to the tests.
    
    Relevant Change in Python: python/cpython#71767
    feanil added a commit to openedx/edx-platform that referenced this issue Apr 11, 2024
    In Python 3.11 CSV files are allowed to have null characters so the test
    data is no longer a valid. We update it to not have a valid unicode
    character to still test this code path correctly.
    
    I'm not concerned about the fact that the files with null will get past
    this test beacause there are other checks on the content of the file
    that catch if it doesn't have enough or the right fields so this should
    be a safe change to make to the tests.
    
    Relevant Change in Python: python/cpython#71767
    feanil added a commit to openedx/edx-platform that referenced this issue Apr 12, 2024
    In Python 3.11 CSV files are allowed to have null characters so the test
    data is no longer a valid. We update it to not have a valid unicode
    character to still test this code path correctly.
    
    I'm not concerned about the fact that the files with null will get past
    this test beacause there are other checks on the content of the file
    that catch if it doesn't have enough or the right fields so this should
    be a safe change to make to the tests.
    
    Relevant Change in Python: python/cpython#71767
    feanil added a commit to openedx/edx-platform that referenced this issue Apr 18, 2024
    In Python 3.11 CSV files are allowed to have null characters so the test
    data is no longer a valid. We update it to not have a valid unicode
    character to still test this code path correctly.
    
    I'm not concerned about the fact that the files with null will get past
    this test beacause there are other checks on the content of the file
    that catch if it doesn't have enough or the right fields so this should
    be a safe change to make to the tests.
    
    Relevant Change in Python: python/cpython#71767
    KyryloKireiev pushed a commit to raccoongang/edx-platform that referenced this issue Apr 24, 2024
    In Python 3.11 CSV files are allowed to have null characters so the test
    data is no longer a valid. We update it to not have a valid unicode
    character to still test this code path correctly.
    
    I'm not concerned about the fact that the files with null will get past
    this test beacause there are other checks on the content of the file
    that catch if it doesn't have enough or the right fields so this should
    be a safe change to make to the tests.
    
    Relevant Change in Python: python/cpython#71767
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.11 only security fixes extension-modules C modules in the Modules dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants