This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: code sample showing errors reading large files with py 2.5/3.0
Type: behavior Stage: test needed
Components: IO, Windows Versions: Python 3.1, Python 2.6
process
Status: closed Resolution: wont fix
Dependencies: Superseder: Newline skipped in "for line in file" for huge file
View: 1744752
Assigned To: tim.peters Nosy List: Richard.Christen@unice.fr, amaury.forgeotdarc, benjamin.peterson, jafo, pitrou, pythonmeister, tim.golden, tim.peters
Priority: normal Keywords:

Created on 2007-09-10 15:52 by Richard.Christen@unice.fr, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
christen.vcf Richard.Christen@unice.fr, 2007-09-11 06:18
christen.vcf Richard.Christen@unice.fr, 2007-09-12 06:10
Messages (11)
msg55785 - (view) Author: christen (Richard.Christen@unice.fr) Date: 2007-09-10 15:52
Error in reading >4Go files under windows

try this:

import sys
print(sys.version_info)
import time
print (time.strftime('%Y-%m-%d %H:%M:%S'))
liste=[]
start = time.time()
fichout=open('test.txt','w')
for i in xrange(85014961):
    if i%5000000==0 and i>0:
        print (i,time.time()-start)
    fichout.write(str(i)+' '*59+'\n')
fichout.close()
print ('total lines written ',i)
print (i,time.time()-start)
print ('*'*50)
fichin=open('test.txt')
start3 = time.time()
for i,li in enumerate(fichin):
    if i%5000000==0 and i>0:
        print (i,time.time()-start3)
fichin.close()
print ('total lines read ',i)
print(time.time()-start)

it generates a >4Go file,not all lines are read !!
example:
('total lines written ', 85014960)
('total lines read ', 85014950)
10 lines are missing

if you replace by
fichout.write(str(i)+' '*59+'\n')

file is now under 4Go, is properly read
Used both a 32 and 64 Windows XP machines

seems to work with Linux and BSD (did not tried this example but had no
pb with my home made big files)
Pb : many examples of >4Go files for the human genome and other
biological applications. Almost sure that people are doing mistakes,
because it took me a while before discovering that...
Note : does not happen with py 3k :-)
msg55786 - (view) Author: christen (Richard.Christen@unice.fr) Date: 2007-09-10 15:54
made an error in copy paste

if you replace by
fichout.write(str(i)+' '*59+'\n')

should be 
if you replace by
fichout.write(str(i)+'\n')
of course :-(
msg55794 - (view) Author: Stefan Sonnenberg-Carstens (pythonmeister) Date: 2007-09-10 21:03
Error confirmed for this python:
Python 3.0a1 (py3k, Sep 10 2007, 22:45:51)
[GCC 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)] on linux2


See this:
stefan@nx6310:~$ python2.4 large_io.py
(2, 4, 4, 'final', 0)
2007-09-10 21:41:52
(5000000, 14.321661949157715)
(10000000, 30.311280965805054)
(15000000, 45.24985408782959)
(20000000, 59.537726879119873)
(25000000, 74.075110912322998)
(30000000, 87.76087498664856)
(35000000, 104.54858303070068)
(40000000, 121.84645009040833)
(45000000, 137.88236308097839)
(50000000, 155.42996501922607)
(55000000, 171.81011009216309)
(60000000, 188.44834208488464)
(65000000, 204.46978211402893)
(70000000, 218.81346702575684)
(75000000, 232.86778998374939)
(80000000, 246.6789391040802)
(85000000, 260.89796900749207)
('total lines written ', 85014960)
(85014960, 260.94281101226807)
**************************************************
(5000000, 14.598887920379639)
(10000000, 29.428265810012817)
(15000000, 44.457981824874878)
(20000000, 60.351485967636108)
(25000000, 79.3228759765625)
(30000000, 94.667810916900635)
(35000000, 110.35149884223938)
(40000000, 126.19746398925781)
(45000000, 141.83787989616394)
(50000000, 157.46236801147461)
(55000000, 173.10227298736572)
(60000000, 188.19510197639465)
(65000000, 197.369295835495)
(70000000, 206.41998481750488)
(75000000, 215.53365993499756)
(80000000, 224.55904102325439)
(85000000, 233.75891900062561)
('total lines read ', 85014960)
494.727725029
stefan@nx6310:~$ python3.0 large_io.py
(3, 0, 0, 'alpha', 1)
2007-09-10 21:50:53
5000000 194.725461006



Tasks: 144 total,   3 running, 141 sleeping,   0 stopped,   0 zombie
Cpu(s): 50.2%us,  1.3%sy,  0.0%ni, 48.3%id,  0.0%wa,  0.2%hi,  0.0%si, 
0.0%st
Mem:   1026804k total,   846416k used,   180388k free,     7952k buffers
Swap:  1028152k total,    66576k used,   961576k free,   679032k cached

PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
28778 stefan    25   0  7800 3552 1596 R  100  0.3   6:01.48 python3.0
msg55801 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2007-09-10 21:55
PythonMeister, what do you mean, "confirmed"? Your read loop ends printing 

('total lines read ', 85014960)

which is the expected output.  (It's one less than the number of lines
written due to a bug in the program -- it prints the 0-based ordinal of
the last line written rather than the total number of lines written,
which is one more. But the bug is the same in the input and output loop.
 Richard's output from the read loop was

('total lines read ', 85014950)

i.e. 10 less than written.

I wonder if the bug is simply a matter of a failure to flush on Windows?
 I can't reproduce it on Linux (Ubuntu dapper).

Richard, can you somehow view the end of the file to see what its last
lines actually are?  It should end like this:

85014951
85014952
85014953
85014954
85014955
85014956
85014957
85014958
85014959
85014960
msg55810 - (view) Author: Stefan Sonnenberg-Carstens (pythonmeister) Date: 2007-09-11 05:29
I can confirm that under Linux (Linux nx6310 2.6.22-1-mepis-smp #1 SMP
PREEMPT Wed Sep 5 22:23:08 EDT 2007 i686 GNU/Linux, SimplyMepis 7.0b3)
1. using Python 3.0a1 is _very_ slow
2. it eats all your cpu (see my post)
I did not take the time to wait for the program to finish with 3.0a1,
as my patience is limited. I don't think it would silently drop lines,
as the windows version.

To see if flushing matters, I'll try this later:

import sys
print(sys.version_info)
import time
print (time.strftime('%Y-%m-%d %H:%M:%S'))
liste=[]
start = time.time()
fichout=open('test.txt','w')
for i in xrange(85014961):
    if i%5000000==0 and i>0:
        print (i,time.time()-start)
    fichout.write(str(i)+' '*59+'\n')
    fishout.flush()
fichout.close()
print ('total lines written ',i)
print (i,time.time()-start)
print ('*'*50)
fichin=open('test.txt')
start3 = time.time()
for i,li in enumerate(fichin):
    if i%5000000==0 and i>0:
        print (i,time.time()-start3)
fichin.close()
print ('total lines read ',i)
print(time.time()-start)


I've seen a case lately on Windows XP SP2 with Python 2.3, where a
college of mine wrote some files he read from a zip file to disk.
Before the close() he also had to flush() the written files
explicitly, otherwise he was not able to rename them afterwards.
His first approach was time.sleep(30), which was not an option.
I'll come back, if I ran the code under Windows.
msg55813 - (view) Author: christen (Richard.Christen@unice.fr) Date: 2007-09-11 06:18
Hi Guido

It is not the end of the file that is not read (see also below)

I found about that about one year ago when I was parsing very large 
files resulting from "blast" on the human genome
My parser chock after 4 Go, well before the end of the file : one line 
was missing and my acc=li[x:y] end up with an error, because acc was 
never filled...
This was kind of strange because this had not happened before with my 
Linux box.

I opened the file (which I had created myself) with a editor that could 
show hexa code : the proper line was there and allright.
If I remember well, I modified my code to see better what was going on : 
in fact the missing line had been concateneted to the previous line 
despite the proper existence of the end of line (hexa code was ok). see 
also below

I forgot about that because nobody replied to my mails, and I thought it 
was possibly related with windows 32 . I moved to a windows 64 recently 
(windows has the best driver for SQL databases) and forgot about the bug 
until I again ran into it. I then decided to try python 3k, it reads 
 >4Go file with no trouble but is so so slow, both in reading and 
writing files.
The following code produces either <4Go or >4Go files depending upon 
which fichout.write is commented
They both have the same line numbers, but the >4Go does not read 
completely under windows (32 or 64)
I have no such pb on Linux or BSD (Mac).

python 3k on windows read both files ok, but is very very slow (change 
xrange to range , I guess it is preposterous to advice you about that :-).

best
Richard

import sys
print(sys.version_info)
import time
print (time.strftime('%Y-%m-%d %H:%M:%S'))
liste=[]
start = time.time()
fichout=open('test.txt','w')
for i in xrange(85014961):
    if i%5000000==0 and i>0:
        print (i,time.time()-start)
    fichout.write(str(i)+' '*59+'\n')      #big file
    #fichout.write(str(i)+'\n')            #small file, same number of lines

    fishout.flush()
fichout.close()
print ('total lines written ',i)
print (i,time.time()-start)
print ('*'*50)
fichin=open('test.txt')
start3 = time.time()
for i,li in enumerate(fichin):
    if i%5000000==0 and i>0:
        print (i,time.time()-start3)
fichin.close()
print ('total lines read ',i)
print(time.time()-start)

> Richard, can you somehow view the end of the file to see what its last
> lines actually are?  It should end like this:
>
> 85014951
> 85014952
> 85014953
> 85014954
> 85014955
> 85014956
> 85014957
> 85014958
> 85014959
> 85014960
>
>   

using a text editor reads:
85014944                                                          
85014945                                                          
85014946                                                          
85014947                                                          
85014948                                                          
85014949                                                          
85014950                                                          
85014951                                                          
85014952                                                          
85014953                                                          
85014954                                                          
85014955                                                          
85014956                                                          
85014957                                                          
85014958                                                          
85014959                                                          
85014960                                                          

windows py 2.5, with
if i>85014940:
        print i, li.strip()

prints :
(2, 5, 0, 'final', 0)
2007-09-11 07:58:47
(5000000, 2.6720001697540283)
(10000000, 5.375)
(15000000, 8.0320000648498535)
(20000000, 10.703000068664551)
(25000000, 13.375)
(30000000, 16.047000169754028)
(35000000, 18.703000068664551)
(40000000, 21.360000133514404)
(45000000, 24.032000064849854)
(50000000, 26.687999963760376)
(55000000, 29.360000133514404)
(60000000, 32.032000064849854)
(65000000, 34.703000068664551)
(70000000, 37.407000064849854)
(75000000, 40.094000101089478)
(80000000, 42.797000169754028)
(85000000, 45.485000133514404)
85014941 85014951                                                          
85014942 85014952                                                          
85014943 85014953                                                          
85014944 85014954                                                          
85014945 85014955                                                          
85014946 85014956                                                          
85014947 85014957                                                          
85014948 85014958                                                          
85014949 85014959                                                          
85014950 85014960  

==> missing lines are from within the file

now introduce in the loop: if len(li)>80: print li.strip()

(2, 5, 0, 'final', 0)
2007-09-11 08:08:16
(5000000, 3.1559998989105225)
(10000000, 6.3280000686645508)
(15000000, 9.4839999675750732)
(20000000, 12.655999898910522)
(25000000, 15.843999862670898)
(30000000, 19.016000032424927)
(35000000, 22.187999963760376)
(40000000, 25.358999967575073)
(45000000, 28.530999898910522)
(50000000, 31.703000068664551)
(55000000, 34.858999967575073)
(60000000, 38.030999898910522)
* 62410138                                                           
62410139 *
* 62414887                                                           
62414888 *
* 62415540                                                           
62415541 *
* 62420289                                                           
62420290 *
* 62420942                                                           
62420943 *
* 62421595                                                           
62421596 *
* 62422248                                                           
62422249 *
* 62422901                                                           
62422902 *
* 62427650                                                           
62427651 *
* 62428303                                                           
62428304 *
(65000000, 41.233999967575073)
(70000000, 44.437999963760376)
(75000000, 47.625)
(80000000, 50.828000068664551)
(85000000, 54.016000032424927)
('total lines read ', 85014950)
54.0309998989

==> end of line not read for 10 lines in the middle of the file ! NTFS 
file system

best
Richard
msg55828 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2007-09-11 17:36
Folks, please focus on one issue at a time, and don't post such long
transcripts.

I know Py3k text I/O is very slow; it's written in Python and uses UTF-8
as the default encoding.  We've got a summer of code student working on
an accelerating this.  (And if he doesn't finish we have another year to
work on it before 3.0final is released.)

So the real problem is that on Windows in 2.x reading files > 4 GB loses
data.  Please try to see if opening the file in binary mode still loses
data.  I suspect a problem in the Windows C stdio library related to
line endings, but who knows.
msg55837 - (view) Author: christen (Richard.Christen@unice.fr) Date: 2007-09-12 06:10
Bug is still there but pb is solved, simply use oepn('file', 'U')
see outputs :

fichin=open('test.txt','U')
===>
(2, 5, 0, 'final', 0)
2007-09-12 08:00:43
(5000000, 9.312000036239624)
(10000000, 22.312000036239624)
(15000000, 35.094000101089478)
(20000000, 47.812000036239624)
(25000000, 60.562000036239624)
(30000000, 73.265000104904175)
(35000000, 85.953000068664551)
(40000000, 98.672000169754028)
(45000000, 111.35900020599365)
(50000000, 123.98400020599365)
(55000000, 136.625)
(60000000, 149.26500010490417)
(65000000, 161.9060001373291)
(70000000, 174.625)
(75000000, 187.29700016975403)
(80000000, 199.89000010490417)
(85000000, 212.5310001373291)
('total lines read ', 85014960)
212.562000036

now with
fichin=open('test.txt')
or
fichin=open('test.txt','r')
===>

(2, 5, 0, 'final', 0)
2007-09-12 08:04:48
(5000000, 3.187999963760376)
(10000000, 6.3440001010894775)
(15000000, 9.4690001010894775)
(20000000, 12.594000101089478)
(25000000, 15.719000101089478)
(30000000, 18.844000101089478)
(35000000, 21.969000101089478)
(40000000, 25.094000101089478)
(45000000, 28.219000101089478)
(50000000, 31.344000101089478)
(55000000, 34.469000101089478)
(60000000, 37.594000101089478)
* 62410138                                                           
62410139 *
* 62414887                                                           
62414888 *
* 62415540                                                           
62415541 *
* 62420289                                                           
62420290 *
* 62420942                                                           
62420943 *
* 62421595                                                           
62421596 *
* 62422248                                                           
62422249 *
* 62422901                                                           
62422902 *
* 62427650                                                           
62427651 *
* 62428303                                                           
62428304 *
(65000000, 40.75)
(70000000, 43.953000068664551)
(75000000, 47.125)
(80000000, 50.328000068664551)
(85000000, 53.516000032424927)
('total lines read ', 85014950)
53.5160000324

best
Richard
msg55841 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2007-09-12 14:40
Cool. This helps track down the bug a bit more; it's either in (our
routine) getline_via_fgets or it's in Microsoft's text mode line end
translation (which universal newlines bypasses).

I'm assigning this to Tim Peters, who probably still has a Windows box
and once optimized the snot out of this code.
msg63808 - (view) Author: Sean Reifschneider (jafo) * (Python committer) Date: 2008-03-17 23:48
I have run this under the current py3k SVN version on an 64-bit Linux
(Fedora 8), and it runs fine, FYI.  ISTR that I had a patch which fixed
something that sounds very much like this, but I can't find that other
issue.
msg116717 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2010-09-17 20:20
issue1744752 describes why it's probably a bug in the C library.
possible workarounds are to open the files in universal mode, to use io.open(), or to switch to python 3!
History
Date User Action Args
2022-04-11 14:56:26adminsetgithub: 45483
2010-09-17 20:20:20amaury.forgeotdarcsetstatus: open -> closed

nosy: + amaury.forgeotdarc
messages: + msg116717

superseder: Newline skipped in "for line in file" for huge file
resolution: wont fix
2010-08-06 16:53:02gvanrossumsetnosy: - gvanrossum
2010-08-06 15:22:17tim.goldensetnosy: + tim.golden
2009-05-12 13:30:39ajaksu2setnosy: + pitrou, benjamin.peterson
versions: + Python 2.6, Python 3.1, - Python 2.5

components: + IO
stage: test needed
2008-03-17 23:48:17jafosetpriority: normal
nosy: + jafo
messages: + msg63808
2007-09-12 14:40:25gvanrossumsetassignee: tim.peters
messages: + msg55841
nosy: + tim.peters
2007-09-12 06:10:52Richard.Christen@unice.frsetfiles: + christen.vcf
messages: + msg55837
2007-09-11 17:36:47gvanrossumsetmessages: + msg55828
components: - Interpreter Core
versions: - Python 3.0
2007-09-11 06:18:19Richard.Christen@unice.frsetfiles: + christen.vcf
messages: + msg55813
2007-09-11 05:29:32pythonmeistersetmessages: + msg55810
2007-09-10 21:55:29gvanrossumsetnosy: + gvanrossum
messages: + msg55801
2007-09-10 21:04:34pythonmeistersettitle: code sample showing errors reading large files with py 2.5 -> code sample showing errors reading large files with py 2.5/3.0
components: + Interpreter Core
versions: + Python 3.0
2007-09-10 21:03:55pythonmeistersetnosy: + pythonmeister
messages: + msg55794
2007-09-10 15:54:06Richard.Christen@unice.frsetmessages: + msg55786
2007-09-10 15:52:42Richard.Christen@unice.frcreate