Issue 28642: csv reader losing rows with big files and tab delimiter

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/72828

classification

Title:	csv reader losing rows with big files and tab delimiter
Type:		Stage:	resolved
Components:	Library (Lib)	Versions:	Python 3.5

process

Status:	closed	Resolution:	not a bug
Dependencies:		Superseder:
Assigned To:		Nosy List:	SilentGhost, datapythonista, mrabarnett, serhiy.storchaka
Priority:	normal	Keywords:

Created on 2016-11-08 17:21 by datapythonista, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
allCountries_sections.zip	mrabarnett, 2016-11-08 21:07

Messages (10)
msg280323 - (view)	Author: Marc Garcia (datapythonista)	Date: 2016-11-08 17:21
I'm using the csv module from Python standard library, to read a 1.4Gb file with 11,157,064 of rows. The file is the Geonames dataset for all countries, which can be freely downloaded [1]. I'm using this code to read it: import csv with open('allCountries.txt', 'r') as fd: reader = csv.reader(fd, delimiter='\t') for i, row in enumerate(reader): pass print(i + 1) # prints 10381963 print(reader.line_num) # prints 11157064 For some reason, there are around 7% of the rows in the files, that are skipped. The rows doesn't have anything special (most of them are all ascii characters, even if the file is in utf-8). If I create a new file with all the skipped files, and I read it again in the same way, around 30% of the rows are skipped. So many of them weren't returned by the iterator when being a part of a bigger file, but now they are. Note that the attribute line_num has the right number. Also note that if I remove the delimiter parameter (tab) from the reader, and it uses the default comma, the iteration on the reader doesn't skip any row. I checked what I think it's the relevant part of the code [2], but I couldn't see anything that could cause this bug. 1. http://download.geonames.org/export/dump/allCountries.zip 2. https://hg.python.org/cpython/file/tip/Modules/_csv.c#l787
msg280324 - (view)	Author: SilentGhost (SilentGhost) *	Date: 2016-11-08 17:28
Could you perhaps make the smaller file make available somewhere?
msg280325 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2016-11-08 17:39
> If I create a new file with all the skipped files, and I read it again in the same way, around 30% of the rows are skipped. Could you please provide this smaller file? Or better make yet few iterations of keeping only skipped lines until the file will decrease to just few lines.
msg280326 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2016-11-08 17:42
What is the average number of columns on the file?
msg280349 - (view)	Author: Matthew Barnett (mrabarnett) *	Date: 2016-11-08 21:07
I split the file into sections, each containing no more 1000 lines, and tried reading each section. Attached is a zip file of those that didn't return the expected number of rows. The problem appears to be due to unclosed quotes, which cause following lines to be consumed as part of the field. It looks a little strange, so I wonder if the file has got corrupted somewhere.
msg280351 - (view)	Author: SilentGhost (SilentGhost) *	Date: 2016-11-08 21:21
so using quoting=csv.QUOTE_NONE should solve the immediate problem of "losing" lines then, I'm not sure csv module ever supported dealing with corrupted files.
msg280352 - (view)	Author: Marc Garcia (datapythonista)	Date: 2016-11-08 21:37
Sorry, my fault. It looks like having quotes in the file was the problem. As mentioned, adding the quoting parameter fixes the problem. I'd assume that if quotes are not paired, csv should raise an exception. And I don't think that all the different chunks of the file I tested, had always an even number of quotes. Also, I don't understand why using the default delimiter worked well, and with tab delimiter the problem happened. I'll take a look in more detail, but I'm closing this issue. Thank you all a lot for the help!
msg280384 - (view)	Author: Marc Garcia (datapythonista)	Date: 2016-11-09 09:11
I could research a bit more on the problem. This is a minimal code that reproduces what happened: from io import StringIO import csv csv_file = StringIO('''1\t"A 2\tB''') reader = csv.reader(csv_file, delimiter='\t') for i, row in enumerate(reader): pass print(reader.line_num) # 2 print(i + 1) # 1 The reason to return the right number of rows with the default delimiter, is because the quote needs to be immediately after the delimiter to be considered the opening of a quoted text. If the file contains an opening quote, and the EOF is reached without its closing quote, the reader considers all the text until EOF to be that field. This would work as expected in a line like: 1,"well quoted text","this one has a missing quote But it'd fail silently with unexpected results in all other cases. I'd expect csv to raise an exception, more than the current behavior. Do you agree? Should I create another issue to address this?
msg280385 - (view)	Author: SilentGhost (SilentGhost) *	Date: 2016-11-09 09:28
No, the module works exactly as advertised. The default value of quoting parameter might not be suitable for this file, but it suits majority of files out there. The fields in csv can contain line feed, so the line in your example does not have a "missing" quote, it's that you're using wrong (the default) quoting parameter.
msg280386 - (view)	Author: Marc Garcia (datapythonista)	Date: 2016-11-09 09:38
I agree that for my case, I was using the wrong quoting parameter, and if I specify that my file has no quotes, it works as expected. But I still think that in a different case, when a file do have quotes, but they are not paired, it'd be better to raise an exception, than to ignore the error and assume there is just a missing quote at the end. From the Zen of Python: "Errors should never pass silently", and I think it's clear that there is an error in the file.

History
Date	User	Action	Args
2022-04-11 14:58:39	admin	set	github: 72828
2016-11-09 09:38:23	datapythonista	set	messages: + msg280386
2016-11-09 09:28:22	SilentGhost	set	messages: + msg280385
2016-11-09 09:11:41	datapythonista	set	messages: + msg280384
2016-11-09 04:52:35	SilentGhost	set	stage: resolved
2016-11-08 21:37:16	datapythonista	set	status: open -> closed resolution: not a bug messages: + msg280352 title: csv reader loosing rows with big files and tab delimiter -> csv reader losing rows with big files and tab delimiter
2016-11-08 21:21:24	SilentGhost	set	messages: + msg280351
2016-11-08 21:07:51	mrabarnett	set	files: + allCountries_sections.zip nosy: + mrabarnett messages: + msg280349
2016-11-08 17:42:25	serhiy.storchaka	set	messages: + msg280326
2016-11-08 17:39:33	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg280325
2016-11-08 17:28:22	SilentGhost	set	nosy: + SilentGhost messages: + msg280324
2016-11-08 17:21:21	datapythonista	create