Issue 1141: reading large files

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/45482

classification

Title:	reading large files
Type:	behavior	Stage:
Components:	Tests	Versions:	Python 3.0

process

Status:	closed	Resolution:	duplicate
Dependencies:		Superseder:
Assigned To:		Nosy List:	Richard.Christen@unice.fr, jafo, loewis, music, pythonmeister
Priority:	normal	Keywords:

Created on 2007-09-10 12:45 by Richard.Christen@unice.fr, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
christen.vcf	Richard.Christen@unice.fr, 2007-09-10 14:20
christen.vcf	Richard.Christen@unice.fr, 2007-09-10 15:46

Messages (10)
msg55777 - (view)	Author: christen (Richard.Christen@unice.fr)	Date: 2007-09-10 12:45
September 11, 2007 I downloaded py 3.k The good news : Under Windows, Python 3k properly reads files larger than 4 Go (in contrast to python 2.5 that skips some lines, see below) The bad news : py 3k is very slow compared to py 2.5; see the results below the code is it reads a 4.9 Go file of 81,017,719 lines (a genbank entry of bacterial sequences) ####################### import time print (time.localtime()) fichin=open(r'D:\pythons\16s\total_gb_161_16S.gb') t0= time.localtime() print (t0) i=0 for li in fichin: i+=1 if i%1000000==0: print (i,time.localtime()) fichin.close() print () print (i) print (time.localtime()) ######################### I got the following results (Windows XP 64) on the same machine, using either py 3k or py 2.5 As soon as my BSD and Linux machines are done with calculations, I will try that on them. Best Richard Christen python 3k (2007, 9, 10, 13, 53, 36, 0, 253, 1) (2007, 9, 10, 13, 53, 36, 0, 253, 1) 1000000 (2007, 9, 10, 13, 53, 49, 0, 253, 1) 2000000 (2007, 9, 10, 13, 54, 3, 0, 253, 1) 3000000 (2007, 9, 10, 13, 54, 18, 0, 253, 1) 4000000 (2007, 9, 10, 13, 54, 32, 0, 253, 1) 5000000 (2007, 9, 10, 13, 54, 47, 0, 253, 1) .... 77000000 (2007, 9, 10, 14, 14, 55, 0, 253, 1) 78000000 (2007, 9, 10, 14, 15, 9, 0, 253, 1) 79000000 (2007, 9, 10, 14, 15, 22, 0, 253, 1) 80000000 (2007, 9, 10, 14, 15, 36, 0, 253, 1) 81000000 (2007, 9, 10, 14, 15, 49, 0, 253, 1) 81017719 #this is the proper number of lines (2007, 9, 10, 14, 15, 50, 0, 253, 1) Python 2.5 (2007, 9, 10, 14, 18, 33, 0, 253, 1) (2007, 9, 10, 14, 18, 33, 0, 253, 1) (1000000, (2007, 9, 10, 14, 18, 34, 0, 253, 1)) (2000000, (2007, 9, 10, 14, 18, 34, 0, 253, 1)) (3000000, (2007, 9, 10, 14, 18, 35, 0, 253, 1)) (4000000, (2007, 9, 10, 14, 18, 35, 0, 253, 1)) (5000000, (2007, 9, 10, 14, 18, 36, 0, 253, 1)) ... (77000000, (2007, 9, 10, 14, 19, 10, 0, 253, 1)) (78000000, (2007, 9, 10, 14, 19, 11, 0, 253, 1)) (79000000, (2007, 9, 10, 14, 19, 11, 0, 253, 1)) (80000000, (2007, 9, 10, 14, 19, 12, 0, 253, 1)) (81000000, (2007, 9, 10, 14, 19, 12, 0, 253, 1)) () 81014962 #python 2.5 missed some lines !!!! (2007, 9, 10, 14, 19, 12, 0, 253, 1)
msg55778 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2007-09-10 14:04
If you would like to help resolving the issue with the missing lines, please submit a separate report for that. It is very difficult to track unrelated bugs in a single tracker issue. It would help if you could determine which lines are missing, e.g. by writing out all lines and then comparing the two files. If you want to compute runtimes, it is better to not convert them to local time. Instead, use the pattern start = time.time() ... print time.time()-start # seconds since the program started
msg55779 - (view)	Author: christen (Richard.Christen@unice.fr)	Date: 2007-09-10 14:20
Hi Martin I could certainly do that, but how you get my huge files ? 5 Go of data is quite big... > If you want to compute runtimes, it is better to not convert them to > local time. Instead, use the pattern > > start = time.time() > ... > print time.time()-start # seconds since the program started > OK I'll do that next time Richard
msg55781 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2007-09-10 14:28
> I could certainly do that, but how you get my huge files ? 5 Go of data > is quite big... [not sure what "that" is] I did not mean to suggest that you attach such a large file. Instead, just report that as a separate bug report, and be prepared to answer follow-up questions. Regards, Martin
msg55782 - (view)	Author: Stefan Sonnenberg-Carstens (pythonmeister)	Date: 2007-09-10 14:29
Perhaps this is an issue of line separation ? Could you provide the output of wc -l on a *NIX box ? And, could you try with this code: import sys print(sys.version_info) import time print (time.localtime()) fichin=open(r'D:\pythons\16s\total_gb_161_16S.gb') start = time.time() for i,li in enumerate(fichin): if i%1000000==0 and i>0: print (i,start-time.time()) fichin.close() print(i) print(start-time.time()) Thx
msg55783 - (view)	Author: Stefan Sonnenberg-Carstens (pythonmeister)	Date: 2007-09-10 14:32
Sorry, this way: import sys print(sys.version_info) import time print (time.strftime('%Y-%m-%d %H:%M:%S')) fichin=open(r'D:\pythons\16s\total_gb_161_16S.gb') start = time.time() for i,li in enumerate(fichin): if i%1000000==0 and i>0: print (i,time.time()-start) fichin.close() print(i) print(time.time()-start)
msg55784 - (view)	Author: christen (Richard.Christen@unice.fr)	Date: 2007-09-10 15:46
Hi Stefan Calculations are underway both read and write do not work well with p3k you can try the code below on your own machine : fichout.write(str(i)+' '59+'\n') #generates a big file fichout.write(str(i)+'\n') #generate file <4Go the big file is not read properly with python 2.5 (the small one is) the big file is long to write and to read with python 3.k I send you the results as soon it is done under 3k (very very slow indeed) best r import sys print(sys.version_info) import time print (time.strftime('%Y-%m-%d %H:%M:%S')) liste=[] start = time.time() fichout=open('test.txt','w') for i in xrange(85014961): if i%5000000==0 and i>0: print (i,time.time()-start) fichout.write(str(i)+' '59+'\n') fichout.close() print ('total lines written ',i) print (i,time.time()-start) print (''50) fichin=open('test.txt') start3 = time.time() for i,li in enumerate(fichin): if i%5000000==0 and i>0: print (i,time.time()-start3) fichin.close() print ('total lines read ',i) print(time.time()-start)
msg55831 - (view)	Author: Ben Beasley (music)	Date: 2007-09-11 18:33
I ran Richard Christen's script from msg55784 on Ubuntu Feisty Fawn (64-bit) with both Python 2.5.1 and Python 3.0a1 (for the latter, I had to change xrange to range). (2, 5, 1, 'final', 0) 2007-09-11 11:39:08 (5000000, 7.3925600051879883) (10000000, 15.068881034851074) (15000000, 22.870260953903198) (20000000, 30.588511943817139) (25000000, 37.977153062820435) (30000000, 45.393024921417236) (35000000, 57.039968013763428) (40000000, 71.122976064682007) (45000000, 85.065402984619141) (50000000, 97.03105092048645) (55000000, 108.22125887870789) (60000000, 122.95617389678955) (65000000, 130.45936799049377) (70000000, 141.0406129360199) (75000000, 150.52000093460083) (80000000, 158.0419979095459) (85000000, 168.46517896652222) ('total lines written ', 85014960) (85014960, 168.48725986480713) ************************************************ (5000000, 11.699964046478271) (10000000, 18.510161876678467) (15000000, 27.110308885574341) (20000000, 35.410284996032715) (25000000, 41.88045597076416) (30000000, 48.734965085983276) (35000000, 56.416620016098022) (40000000, 65.14509105682373) (45000000, 73.711935043334961) (50000000, 82.278150081634521) (55000000, 90.984658002853394) (60000000, 99.987648963928223) (65000000, 104.64127588272095) (70000000, 109.73277306556702) (75000000, 114.78491401672363) (80000000, 120.38562488555908) (85000000, 126.08317303657532) ('total lines read ', 85014960) 294.583214998 (3, 0, 0, 'alpha', 1) 2007-09-11 12:20:53 5000000 117.375117064 10000000 238.183109045 15000000 357.397506952 20000000 476.816791058 25000000 597.198447943 30000000 717.393661976 35000000 837.278333902 40000000 956.919227839 45000000 1077.25333095 50000000 1196.60731292 55000000 1316.08601999 60000000 1434.81360602 65000000 1554.1584239 70000000 1673.04580498 75000000 1792.35387397 80000000 1912.65659904 85000000 2032.99598598 total lines written 85014960 85014960 2033.35042787 ************************************************ 5000000 89.7920100689 10000000 180.910079002 15000000 272.628970146 20000000 364.904497147 25000000 457.229861021 30000000 549.14190793 35000000 641.054435968 40000000 733.30577898 45000000 826.058191061 50000000 917.997677088 55000000 1010.20616603 60000000 1102.142905 65000000 1194.16728902 70000000 1286.54789495 75000000 1378.50006604 80000000 1470.37746692 85000000 1562.25738001 total lines read 85014960 3595.88338494
msg55832 - (view)	Author: Ben Beasley (music)	Date: 2007-09-11 18:39
See the BDFL's comment in msg55828. "I know Py3k text I/O is very slow; it's written in Python and uses UTF-8 as the default encoding. We've got a summer of code student working on an accelerating this. (And if he doesn't finish we have another year to work on it before 3.0final is released.)"
msg55988 - (view)	Author: Sean Reifschneider (jafo) *	Date: 2007-09-18 04:36
I'm closing this because the slow I/O issue is known and expected to be resolved as part of the Python 3.0 development. The Windows problems with missing lines should be opened as a separate issue.

History
Date	User	Action	Args
2022-04-11 14:56:26	admin	set	github: 45482
2007-09-18 04:36:00	jafo	set	status: open -> closed nosy: + jafo resolution: duplicate messages: + msg55988
2007-09-11 18:39:49	music	set	messages: + msg55832
2007-09-11 18:33:46	music	set	nosy: + music messages: + msg55831
2007-09-10 15:46:27	Richard.Christen@unice.fr	set	files: + christen.vcf messages: + msg55784
2007-09-10 14:32:24	pythonmeister	set	messages: + msg55783
2007-09-10 14:29:24	pythonmeister	set	nosy: + pythonmeister messages: + msg55782
2007-09-10 14:28:44	loewis	set	messages: + msg55781
2007-09-10 14:20:50	Richard.Christen@unice.fr	set	files: + christen.vcf messages: + msg55779
2007-09-10 14:04:50	loewis	set	nosy: + loewis messages: + msg55778
2007-09-10 12:45:03	Richard.Christen@unice.fr	create