Message 280323 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	datapythonista
Recipients	datapythonista
Date	2016-11-08.17:21:21
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1478625681.85.0.319156602721.issue28642@psf.upfronthosting.co.za>
In-reply-to

Content
I'm using the csv module from Python standard library, to read a 1.4Gb file with 11,157,064 of rows. The file is the Geonames dataset for all countries, which can be freely downloaded [1]. I'm using this code to read it: import csv with open('allCountries.txt', 'r') as fd: reader = csv.reader(fd, delimiter='\t') for i, row in enumerate(reader): pass print(i + 1) # prints 10381963 print(reader.line_num) # prints 11157064 For some reason, there are around 7% of the rows in the files, that are skipped. The rows doesn't have anything special (most of them are all ascii characters, even if the file is in utf-8). If I create a new file with all the skipped files, and I read it again in the same way, around 30% of the rows are skipped. So many of them weren't returned by the iterator when being a part of a bigger file, but now they are. Note that the attribute line_num has the right number. Also note that if I remove the delimiter parameter (tab) from the reader, and it uses the default comma, the iteration on the reader doesn't skip any row. I checked what I think it's the relevant part of the code [2], but I couldn't see anything that could cause this bug. 1. http://download.geonames.org/export/dump/allCountries.zip 2. https://hg.python.org/cpython/file/tip/Modules/_csv.c#l787

I'm using the csv module from Python standard library, to read a 1.4Gb file with 11,157,064 of rows. The file is the Geonames dataset for all countries, which can be freely downloaded [1].

I'm using this code to read it:

    import csv

    with open('allCountries.txt', 'r') as fd:
        reader = csv.reader(fd, delimiter='\t')
        for i, row in enumerate(reader):
            pass

    print(i + 1)  # prints 10381963
    print(reader.line_num)  # prints 11157064

For some reason, there are around 7% of the rows in the files, that are skipped. The rows doesn't have anything special (most of them are all ascii characters, even if the file is in utf-8).

If I create a new file with all the skipped files, and I read it again in the same way, around 30% of the rows are skipped. So many of them weren't returned by the iterator when being a part of a bigger file, but now they are.

Note that the attribute line_num has the right number. Also note that if I remove the delimiter parameter (tab) from the reader, and it uses the default comma, the iteration on the reader doesn't skip any row.

I checked what I think it's the relevant part of the code [2], but I couldn't see anything that could cause this bug.


1. http://download.geonames.org/export/dump/allCountries.zip
2. https://hg.python.org/cpython/file/tip/Modules/_csv.c#l787

History
Date	User	Action	Args
2016-11-08 17:21:21	datapythonista	set	recipients: + datapythonista
2016-11-08 17:21:21	datapythonista	set	messageid: <1478625681.85.0.319156602721.issue28642@psf.upfronthosting.co.za>
2016-11-08 17:21:21	datapythonista	link	issue28642 messages
2016-11-08 17:21:21	datapythonista	create