This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author runedevik
Recipients
Date 2007-06-28.11:23:31
SpamBayes Score
Marked as misclassified
Message-id
In-reply-to
Content
Creating new ticket for the bug described here since it was closed (and I was not able to reopen it): http://sourceforge.net/tracker/index.php?func=detail&aid=1636950&group_id=5470&atid=105470

The problem is that when you open a hughe file on windows with the "r" mode it will sometimes merge two lines. As I said in the ticket above (but probably ignored since I updated a closed ticket):

Hi

I have the same problem with a huge file (8GB) containing long lines. Sometimes two lines are merged into one and rerunning the test script that reads the file it's always the same lines that are merged. Also the merging happens more frequently towards the end of the file it seems. I tried to reproduce with a smaller data set (10 lines before the two lines that get merged, the two lines that gets merged and the 10 lines after that) but I was not able to reproduce on this smaller data set. However if you open this huge file in "rb" mode instead of "r" mode everything works as it should and no lines are merged at all! If I copy the file over to linux and rerun the test script no lines are merged (regardless if mode is "r" or "rb") so this is windows specific and might have something todo with the adding of \r\n if only \n is found when you open the file in "r" mode maybe? Also I have reproduced it on both python 2.3.5 and 2.5c1 on both windows XP and windows 2003. 

More stats on the input file in both "r" mode and "rb" mode below:

Input file size: 8 695 828 KB

fp = open(file, "r"):
  - total number of lines read:  668909
  - length of the longest line:  13179792
  - length of the shortest line: 89
  - 56 lines contains the content of two lines
  - Always just two lines that are merged into one! 
  - Always the same lines that are merged rerunning the test on the same file. 

open(file, "rb"):
  - total number of lines read:  668965
  - length of the longest line:  13179793
  - length of the shortest line: 90
  - no lines merged

Regards,
Rune Devik
History
Date User Action Args
2007-08-23 14:58:06adminlinkissue1744752 messages
2007-08-23 14:58:06admincreate