This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: csv reader losing rows with big files and tab delimiter
Type: Stage: resolved
Components: Library (Lib) Versions: Python 3.5
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: SilentGhost, datapythonista, mrabarnett, serhiy.storchaka
Priority: normal Keywords:

Created on 2016-11-08 17:21 by datapythonista, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
allCountries_sections.zip mrabarnett, 2016-11-08 21:07
Messages (10)
msg280323 - (view) Author: Marc Garcia (datapythonista) Date: 2016-11-08 17:21
I'm using the csv module from Python standard library, to read a 1.4Gb file with 11,157,064 of rows. The file is the Geonames dataset for all countries, which can be freely downloaded [1].

I'm using this code to read it:

    import csv

    with open('allCountries.txt', 'r') as fd:
        reader = csv.reader(fd, delimiter='\t')
        for i, row in enumerate(reader):
            pass

    print(i + 1)  # prints 10381963
    print(reader.line_num)  # prints 11157064

For some reason, there are around 7% of the rows in the files, that are skipped. The rows doesn't have anything special (most of them are all ascii characters, even if the file is in utf-8).

If I create a new file with all the skipped files, and I read it again in the same way, around 30% of the rows are skipped. So many of them weren't returned by the iterator when being a part of a bigger file, but now they are.

Note that the attribute line_num has the right number. Also note that if I remove the delimiter parameter (tab) from the reader, and it uses the default comma, the iteration on the reader doesn't skip any row.

I checked what I think it's the relevant part of the code [2], but I couldn't see anything that could cause this bug.


1. http://download.geonames.org/export/dump/allCountries.zip
2. https://hg.python.org/cpython/file/tip/Modules/_csv.c#l787
msg280324 - (view) Author: SilentGhost (SilentGhost) * (Python triager) Date: 2016-11-08 17:28
Could you perhaps make the smaller file make available somewhere?
msg280325 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-11-08 17:39
> If I create a new file with all the skipped files, and I read it again in the same way, around 30% of the rows are skipped.

Could you please provide this smaller file? Or better make yet few iterations of keeping only skipped lines until the file will decrease to just few lines.
msg280326 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-11-08 17:42
What is the average number of columns on the file?
msg280349 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2016-11-08 21:07
I split the file into sections, each containing no more 1000 lines, and tried reading each section. Attached is a zip file of those that didn't return the expected number of rows.

The problem appears to be due to unclosed quotes, which cause following lines to be consumed as part of the field.

It looks a little strange, so I wonder if the file has got corrupted somewhere.
msg280351 - (view) Author: SilentGhost (SilentGhost) * (Python triager) Date: 2016-11-08 21:21
so using quoting=csv.QUOTE_NONE should solve the immediate problem of "losing" lines then, I'm not sure csv module ever supported dealing with corrupted files.
msg280352 - (view) Author: Marc Garcia (datapythonista) Date: 2016-11-08 21:37
Sorry, my fault. It looks like having quotes in the file was the problem. As mentioned, adding the quoting parameter fixes the problem.

I'd assume that if quotes are not paired, csv should raise an exception. And I don't think that all the different chunks of the file I tested, had always an even number of quotes.

Also, I don't understand why using the default delimiter worked well, and with tab delimiter the problem happened.

I'll take a look in more detail, but I'm closing this issue.

Thank you all a lot for the help!
msg280384 - (view) Author: Marc Garcia (datapythonista) Date: 2016-11-09 09:11
I could research a bit more on the problem. This is a minimal code that reproduces what happened:

    from io import StringIO
    import csv

    csv_file = StringIO('''1\t"A
    2\tB''')

    reader = csv.reader(csv_file, delimiter='\t')
    for i, row in enumerate(reader):
        pass

    print(reader.line_num)  # 2
    print(i + 1)            # 1

The reason to return the right number of rows with the default delimiter, is because the quote needs to be immediately after the delimiter to be considered the opening of a quoted text.

If the file contains an opening quote, and the EOF is reached without its closing quote, the reader considers all the text until EOF to be that field.

This would work as expected in a line like:

    1,"well quoted text","this one has a missing quote

But it'd fail silently with unexpected results in all other cases. I'd expect csv to raise an exception, more than the current behavior.

Do you agree? Should I create another issue to address this?
msg280385 - (view) Author: SilentGhost (SilentGhost) * (Python triager) Date: 2016-11-09 09:28
No, the module works exactly as advertised. The default value of quoting parameter might not be suitable for this file, but it suits majority of files out there. The fields in csv can contain line feed, so the line in your example does not have a "missing" quote, it's that you're using wrong (the default) quoting parameter.
msg280386 - (view) Author: Marc Garcia (datapythonista) Date: 2016-11-09 09:38
I agree that for my case, I was using the wrong quoting parameter, and if I specify that my file has no quotes, it works as expected.

But I still think that in a different case, when a file do have quotes, but they are not paired, it'd be better to raise an exception, than to ignore the error and assume there is just a missing quote at the end.

From the Zen of Python: "Errors should never pass silently", and I think it's clear that there is an error in the file.
History
Date User Action Args
2022-04-11 14:58:39adminsetgithub: 72828
2016-11-09 09:38:23datapythonistasetmessages: + msg280386
2016-11-09 09:28:22SilentGhostsetmessages: + msg280385
2016-11-09 09:11:41datapythonistasetmessages: + msg280384
2016-11-09 04:52:35SilentGhostsetstage: resolved
2016-11-08 21:37:16datapythonistasetstatus: open -> closed
resolution: not a bug
messages: + msg280352

title: csv reader loosing rows with big files and tab delimiter -> csv reader losing rows with big files and tab delimiter
2016-11-08 21:21:24SilentGhostsetmessages: + msg280351
2016-11-08 21:07:51mrabarnettsetfiles: + allCountries_sections.zip
nosy: + mrabarnett
messages: + msg280349

2016-11-08 17:42:25serhiy.storchakasetmessages: + msg280326
2016-11-08 17:39:33serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg280325
2016-11-08 17:28:22SilentGhostsetnosy: + SilentGhost
messages: + msg280324
2016-11-08 17:21:21datapythonistacreate