classification
Title: Newline skipped in "for line in file" for huge file
Type: behavior Stage: test needed
Components: Library (Lib), Windows Versions: Python 3.2, Python 3.1, Python 2.7
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: Nosy List: BreamoreBoy, amaury.forgeotdarc, benjamin.peterson, brian.curtin, mrabarnett, nnorwitz, pitrou, rhettinger, runedevik, tim.golden
Priority: normal Keywords:

Created on 2007-06-28 11:23 by runedevik, last changed 2010-09-17 20:11 by amaury.forgeotdarc. This issue is now closed.

Messages (7)
msg32416 - (view) Author: Rune Devik (runedevik) Date: 2007-06-28 11:23
Creating new ticket for the bug described here since it was closed (and I was not able to reopen it): http://sourceforge.net/tracker/index.php?func=detail&aid=1636950&group_id=5470&atid=105470

The problem is that when you open a hughe file on windows with the "r" mode it will sometimes merge two lines. As I said in the ticket above (but probably ignored since I updated a closed ticket):

Hi

I have the same problem with a huge file (8GB) containing long lines. Sometimes two lines are merged into one and rerunning the test script that reads the file it's always the same lines that are merged. Also the merging happens more frequently towards the end of the file it seems. I tried to reproduce with a smaller data set (10 lines before the two lines that get merged, the two lines that gets merged and the 10 lines after that) but I was not able to reproduce on this smaller data set. However if you open this huge file in "rb" mode instead of "r" mode everything works as it should and no lines are merged at all! If I copy the file over to linux and rerun the test script no lines are merged (regardless if mode is "r" or "rb") so this is windows specific and might have something todo with the adding of \r\n if only \n is found when you open the file in "r" mode maybe? Also I have reproduced it on both python 2.3.5 and 2.5c1 on both windows XP and windows 2003. 

More stats on the input file in both "r" mode and "rb" mode below:

Input file size: 8 695 828 KB

fp = open(file, "r"):
  - total number of lines read:  668909
  - length of the longest line:  13179792
  - length of the shortest line: 89
  - 56 lines contains the content of two lines
  - Always just two lines that are merged into one! 
  - Always the same lines that are merged rerunning the test on the same file. 

open(file, "rb"):
  - total number of lines read:  668965
  - length of the longest line:  13179793
  - length of the shortest line: 90
  - no lines merged

Regards,
Rune Devik
msg32417 - (view) Author: Neal Norwitz (nnorwitz) * (Python committer) Date: 2007-07-04 05:16
Without a reproducible test case, there's really nothing we can do.  You will need to debug this on your own.  Try setting a breakpoint in the debugger in the file object, probably in get_line().  If you can make a self contained test case, then we can help.
msg85635 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-04-06 11:36
The only difference between "r" and "rb" (in Python 2.x) is that under
Windows, "r" mode converts "\r\n" line endings into "\n". But it's the
Windows C stdlib which does that, not Python. So maybe a bug in Windows
itself?
msg85782 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2009-04-08 18:37
What do you mean "towards the end of the file"? What are the offsets of
the two lines? (I'm thinking it might be something to do with the \r\n
lying across a boundary, such as the 4GB boundary.)
msg116696 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2010-09-17 17:59
@Brian/Tim any thoughts on this?
msg116709 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2010-09-17 18:50
We need a reproducible test before being able to go forward with this.  At the very least, that would help isolate whether this is a build specific C library issue.
msg116716 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2010-09-17 20:11
I think there's actually a bug in the MSVCRT read() function, which was not too hard to spot (see explanation below).  In short, a CRLF file opened in text mode may skip a newline after 4GB.

I'm re-closing the issue as "won't fix". There's really nothing we can do about it.  But note that Python 3.x is not affected (raw files are always opened in binary mode and CRLF translation is done by Python); with 2.7, you may use io.open().

Other issues: issue1142, issue1672853, issue1451466 also report the same end-of-line issue on Windows (I just searched for "windows gb" in the tracker...) I'll close them as well.

Now, the explanation of the bug; it's not easy to reproduce because it depends both on the internal FILE buffer size and the number of chars passed to fread().
In the Microsoft CRT source code, in open.c, there is a block starting with this encouraging comment "This is the hard part.  We found a CR at end of buffer.  We must peek ahead to see if next char is an LF."
Oddly, there is an almost exact copy of this function in Perl source code:
http://perl5.git.perl.org/perl.git/blob/4342f4d6df6a7dfa22a470aa21e54a5622c009f3:/win32/win32.c#l3668
The problem is in the call to SetFilePointer(), used to step back one position after the lookahead; it will fail because it is unable to return the current position in a 32bit DWORD. [The fix is easy; do you see it?]
At this point, the function thinks that the next read() will return the LF, but it won't because the file pointer was not moved back.
History
Date User Action Args
2010-09-17 20:22:25amaury.forgeotdarclinkissue1451466 superseder
2010-09-17 20:21:27amaury.forgeotdarclinkissue1672853 superseder
2010-09-17 20:20:20amaury.forgeotdarclinkissue1142 superseder
2010-09-17 20:11:22amaury.forgeotdarcsetresolution: not a bug -> wont fix

messages: + msg116716
nosy: + amaury.forgeotdarc
2010-09-17 18:50:53rhettingersetstatus: open -> closed

nosy: + rhettinger
messages: + msg116709

resolution: not a bug
2010-09-17 17:59:20BreamoreBoysetnosy: + BreamoreBoy, tim.golden, brian.curtin
title: Newline skipped in "for line in file" -> Newline skipped in "for line in file" for huge file
messages: + msg116696

versions: + Python 3.1, Python 3.2, - Python 2.6
2009-04-08 18:37:22mrabarnettsetnosy: + mrabarnett
messages: + msg85782
2009-04-06 11:36:40pitrousetmessages: + msg85635
2009-04-06 11:19:57pitrousetversions: + Python 2.7, - Python 3.1
2009-04-06 10:30:47ajaksu2setversions: + Python 2.6, Python 3.1, - Python 2.5
nosy: + pitrou, benjamin.peterson

components: + Windows
type: behavior
stage: test needed
2007-06-28 11:23:31runedevikcreate