classification
Title: csv reader chokes on bad quoting in large files
Type: enhancement Stage: resolved
Components: Library (Lib) Versions: Python 3.7
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: Nosy List: Mariatta, keef604, peter.otten, r.david.murray, rhettinger, terry.reedy
Priority: normal Keywords:

Created on 2017-04-10 21:50 by keef604, last changed 2017-04-15 13:24 by keef604. This issue is now closed.

Messages (12)
msg291453 - (view) Author: Keith Erskine (keef604) * Date: 2017-04-10 21:50
If a csv file has a quote character at the beginning of a field but no closing quote, the csv module will keep reading the file until the very end in an attempt to close out the field.  It's true this situation occurs only when the quoting in a csv file is incorrect, but it would be extremely helpful if the csv reader could be told to stop reading each row of fields when it encounters a newline character, even if it is within a quoted field at the time.  At the moment, with large files, the csv reader will typically error out in this situation once it reads the maximum size of a string.  Furthermore, this is not an easy situation to trap with custom code.

Here's an example of the what I'm talking about.  For a csv file with the following content:
a,b,c
d,"e,f
g,h,i

This code:

    import csv
    with open('file.txt') as f:
        reader = csv.reader(f)
        for row in reader:
            print(row)

returns:
['a', 'b', 'c']
['d', 'e,f\ng,h,i\n']

Note that the whole of the file after "e", including delimiters and newlines, has been added to the second field on the second line. This is correct csv behavior but is very unhelpful to me in this situation.

On the grounds that most csv files do not have multiline values within them, perhaps a new dialect attribute called "multiline" could be added to the csv module, that defaults to True for backwards compatibility.  It would indicate whether the csv file has any field values within it that span more than one line.  If multiline is False, then the "parse_process_char" function in "_csv" would always close out a row of fields when it encounters a newline character.  It might be best if this multiline attribute were taken into account only when "strict" is False.

Right now, I do get badly-formatted files like this, and I cannot ask the source for a new file.  I have to manually correct the file using a mixture of custom scripts and vi before the csv module will read it. It would be very helpful if csv would handle this directly.
msg291454 - (view) Author: Keith Erskine (keef604) * Date: 2017-04-10 22:06
Perhaps I should add what I would prefer the csv reader to return in my example above.  That would be:

['a', 'b', 'c']
['d', 'e,f']
['g', 'h', 'i']

Yes, the second line is still mangled but at least the csv reader would carry on and read the third line correctly.
msg291455 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2017-04-10 22:15
In my experience CSV files with fields with embedded newlines are pretty common.  I don't really think we want to support invalid CSV files.
msg291456 - (view) Author: Keith Erskine (keef604) * Date: 2017-04-10 23:09
The csv reader already handles a certain amount of bad formatting.  For example, using default behavior, the following file:

a,b,c
d,"e"X,f
g,h,i

is read as:
['a', 'b', 'c']
['d', 'eX', 'f']
['g', 'h', 'i']

It seems reasonable that csv should be able to handle delimited files that are not perfectly formatted.  After all, even the CSV "standard" isn't really a standard.  When dealing with large (10GB+) files, it's a pain if csv cannot read the file because of just one misplaced quote character.  Besides, data files are only going to get bigger.

Also, I have to say, I've been dealing with large ETL jobs for over 15 years now and I'm struggling to think of a time when I've ever seen a multiline CSV file.  Of course, we've all have different experiences.
msg291457 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2017-04-11 00:33
Well, ETL is semi-standardized.  Try dealing with csv files exported from excel spreadsheets written by non-programmers :)

"e"X is not a quoting the csv module will produce, but I don't think it is a csv error.  insofar as csv has a standard, it is a defacto standard based on what Microsoft tools do.  What does excel, for example, do with either of those examples?
msg291458 - (view) Author: Keith Erskine (keef604) * Date: 2017-04-11 01:11
As you say, David, however much we would like the world to stick to a given CSV standard, the reality is that people don't, which is all the more reason for making the csv reader flexible and forgiving.

The csv module can and should be used for more than just "comma-separated-values" files.  I use it for all sorts of different delimited files, and it works very well.  Pandas uses it, as I'm sure do many other packages.  It's such a good module, it would be a pity to restrict its scope to just Excel-related scenarios.  Parsing delimited files is undoubtedly complex, and painfully slow if done with pure Python, so the more that can be done in C the better.

I'm no C programmer, but my guesstimate is that the coding changes I'm proposing are relatively modest.  In the IN_QUOTED_FIELD section (https://github.com/python/cpython/blob/master/Modules/_csv.c#L690), it would mean checking for newline characters if the new "multiline" attribute is False (and probably "strict" is False too).  Of course there is more to this change than just that, but I'm guessing not that much more.
msg291460 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2017-04-11 04:37
> In my experience CSV files with fields with embedded newlines 
> are pretty common.  I don't really think we want to support
> invalid CSV files.

I concur with David on both points.   Also, whether common or not, we don't want to break existing code that already works.

I recommend marking this as rejected and closing.
msg291462 - (view) Author: Peter Otten (peter.otten) * Date: 2017-04-11 07:39
While I don't think that the csv module should second-guess broken input you might consider "fixing" your data on the fly:

def close_quote(line):
    if line.count('"') % 2:
        line = line.rstrip("\n") + '"\n'
    return line

with open("data.csv") as f:
    for row in csv.reader(map(close_quote, f)):
        print(row)

That should give the desired output.
msg291495 - (view) Author: Keith Erskine (keef604) * Date: 2017-04-11 13:52
The csv reader already supports bad CSV - that's what I believe "strict" is for - but only in one specific scenario.  My request is to make that "strict" attribute a bit more useful.

Thank you for your suggestion, Peter.  I have toyed with the idea of looking for an even number of double quotes in each line, but thank you for your neat way of encapsulating it.  (I already have to strip null bytes out of the input data because they break csv, see issue #27580).
msg291498 - (view) Author: Keith Erskine (keef604) * Date: 2017-04-11 14:53
I should have said, Peter, an odd number of quotes does not necessarily mean the quoting is bad.  For example, a line of:
a,b",c
will parse fine as ['a', 'b"', 'c'].  Figuring out bad quoting is not easy, but if we know that there are no multiline fields in the file, then at least the parsing can stop at the end of the line.
msg291700 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2017-04-15 05:05
Keith, while I sympathize with the request, I am going to agree with the others and close this.  As I see it, csv exists to solve a particular problem.  We want a printable file of records and text fields with visible field and record separators, but we may want to use those separators within fields.  So we add a quote character as an auxiliary delimiter and ignore separators within quotes.  Files with mismatched quotes are broken.

You are willing to append a quote, even though that may be the wrong thing to do.  Alternatives include: skip the line, perhaps after logging it; delete the last quote to make the count even; provide a more sophisticated auto-editing function; or display a GUI entry box initialized to the bad line and request the user to edit until quotes are matched.  I don't think we should pick any of these.
msg291714 - (view) Author: Keith Erskine (keef604) * Date: 2017-04-15 13:24
OK Terry.  Thank you everybody for your thoughts and suggestions.
History
Date User Action Args
2017-04-15 13:24:14keef604setmessages: + msg291714
2017-04-15 05:05:51terry.reedysetstatus: open -> closed

nosy: + terry.reedy
messages: + msg291700

resolution: rejected
stage: resolved
2017-04-11 14:53:10keef604setmessages: + msg291498
2017-04-11 13:52:16keef604setmessages: + msg291495
2017-04-11 07:39:06peter.ottensetnosy: + peter.otten
messages: + msg291462
2017-04-11 04:37:33rhettingersetnosy: + rhettinger
messages: + msg291460
2017-04-11 01:11:59keef604setmessages: + msg291458
2017-04-11 00:33:19r.david.murraysetmessages: + msg291457
2017-04-10 23:09:11keef604setmessages: + msg291456
2017-04-10 22:15:01r.david.murraysetnosy: + r.david.murray
messages: + msg291455
2017-04-10 22:06:54keef604setmessages: + msg291454
2017-04-10 21:53:37Mariattasetnosy: + Mariatta
components: + Library (Lib)
2017-04-10 21:50:25keef604create