This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author richardchristen
Recipients
Date 2007-07-02.07:11:00
SpamBayes Score
Marked as misclassified
Message-id
In-reply-to
Content
In 2006, I signaled a bug in windows 32 for reading very large files : python-Bugs-1451466

I have now tried with a windows 64 machines and python 2.5
I find the same bug

For very large files (the two I tried were around 7-8 Go), the end of line is sometimes not taken into account

The file is fine, as viewed in hexa, the end of line characters are perfectly ok at the place where the parser goes wrong.
Everything seems to be ok with the same script on my Mac OSX

Exemple :
Original file reads:
###########################
.........
Query= 10|ENSG00000203288|pseudogene|105829416|105829650|-
1|ENSE00001440927|105829519|105829650|-1|1
         (132 letters)

Database: Homo_sapiens.NCBI36.45.dna.chromosome17 
           1 sequences; 78,774,742 total letters
...............
###########################

in hexa:
###########################
...
c5bd3500h: 32 2E 0D 0A 0D 0A 51 75 65 72 79 3D 20 31 30 7C ; 2.....Query= 10|
c5bd3510h: 45 4E 53 47 30 30 30 30 30 32 30 33 32 38 38 7C ; ENSG00000203288|
c5bd3520h: 70 73 65 75 64 6F 67 65 6E 65 7C 31 30 35 38 32 ; pseudogene|10582
c5bd3530h: 39 34 31 36 7C 31 30 35 38 32 39 36 35 30 7C 2D ; 9416|105829650|-
c5bd3540h: 0D 0A 31 7C 45 4E 53 45 30 30 30 30 31 34 34 30 ; ..1|ENSE00001440
c5bd3550h: 39 32 37 7C 31 30 35 38 32 39 35 31 39 7C 31 30 ; 927|105829519|10
c5bd3560h: 35 38 32 39 36 35 30 7C 2D 31 7C 31 0D 0A 20 20 ; 5829650|-1|1..  
c5bd3570h: 20 20 20 20 20 20 20 28 31 33 32 20 6C 65 74 74 ;        (132 lett
c5bd3580h: 65 72 73 29 0D 0A 0D 0A 44 61 74 61 62 61 73 65 ; ers)....Database
c5bd3590h: 3A 20 48 6F 6D 6F 5F 73 61 70 69 65 6E 73 2E 4E ; : Homo_sapiens.N
c5bd35a0h: 43 42 49 33 36 2E 34 35 2E 64 6E 61 2E 63 68 72 ; CBI36.45.dna.chr
c5bd35b0h: 6F 6D 6F 73 6F 6D 65 31 37 20 0D 0A 20 20 20 20 ; omosome17 ..    
c5bd35c0h: 20 20 20 20 20 20 20 31 20 73 65 71 75 65 6E 63 ;        1 sequenc
c5bd35d0h: 65 73 3B 20 37 38 2C 37 37 34 2C 37 34 32 20 74 ; es; 78,774,742 t
c5bd35e0h: 6F 74 61 6C 20 6C 65 74 74 65 72 73 0D 0A 0D 0A ; otal letters....
...
#######################################


Demo: python script :
#############################
import os.path
initial_dir=r'D:\human_exons\chr17'	
fichier=os.path.join(initial_dir, '10_17.out')
fichin=open(fichier)
ok=0
i=0
for li in fichin:
	i+=1
	if li.startswith('Query= '):
		query=li
	elif li.startswith('1|ENSE00001440927|105829519|105829650|-1|1'):
		ok=1
	if ok==1: 
		print i
		print query
		print li

fichin.close()
################################

output :
160968087
Query= 10|ENSG00000203288|pseudogene|105829416|105829650|-

1|ENSE00001440927|105829519|105829650|-1|1         (132 letters)

160968088
Query= 10|ENSG00000203288|pseudogene|105829416|105829650|-

in fact line 160968087, should be 160981763



####################################
Computer 
Dell Precision PWS690 2 CPU dual core
Intel Xeon
5160 @ 3.00GHz
2.99 GHz, 16.0 GB of RAM

Microsoft Windows XP
Professional x64 Edition
Version 2003
Windows [Version 5.2.3790]

#####################################

Richard Christen
History
Date User Action Args
2007-08-23 14:38:33adminlinkissue1451466 messages
2007-08-23 14:38:33admincreate