Message27796
In 2006, I signaled a bug in windows 32 for reading very large files : python-Bugs-1451466
I have now tried with a windows 64 machines and python 2.5
I find the same bug
For very large files (the two I tried were around 7-8 Go), the end of line is sometimes not taken into account
The file is fine, as viewed in hexa, the end of line characters are perfectly ok at the place where the parser goes wrong.
Everything seems to be ok with the same script on my Mac OSX
Exemple :
Original file reads:
###########################
.........
Query= 10|ENSG00000203288|pseudogene|105829416|105829650|-
1|ENSE00001440927|105829519|105829650|-1|1
(132 letters)
Database: Homo_sapiens.NCBI36.45.dna.chromosome17
1 sequences; 78,774,742 total letters
...............
###########################
in hexa:
###########################
...
c5bd3500h: 32 2E 0D 0A 0D 0A 51 75 65 72 79 3D 20 31 30 7C ; 2.....Query= 10|
c5bd3510h: 45 4E 53 47 30 30 30 30 30 32 30 33 32 38 38 7C ; ENSG00000203288|
c5bd3520h: 70 73 65 75 64 6F 67 65 6E 65 7C 31 30 35 38 32 ; pseudogene|10582
c5bd3530h: 39 34 31 36 7C 31 30 35 38 32 39 36 35 30 7C 2D ; 9416|105829650|-
c5bd3540h: 0D 0A 31 7C 45 4E 53 45 30 30 30 30 31 34 34 30 ; ..1|ENSE00001440
c5bd3550h: 39 32 37 7C 31 30 35 38 32 39 35 31 39 7C 31 30 ; 927|105829519|10
c5bd3560h: 35 38 32 39 36 35 30 7C 2D 31 7C 31 0D 0A 20 20 ; 5829650|-1|1..
c5bd3570h: 20 20 20 20 20 20 20 28 31 33 32 20 6C 65 74 74 ; (132 lett
c5bd3580h: 65 72 73 29 0D 0A 0D 0A 44 61 74 61 62 61 73 65 ; ers)....Database
c5bd3590h: 3A 20 48 6F 6D 6F 5F 73 61 70 69 65 6E 73 2E 4E ; : Homo_sapiens.N
c5bd35a0h: 43 42 49 33 36 2E 34 35 2E 64 6E 61 2E 63 68 72 ; CBI36.45.dna.chr
c5bd35b0h: 6F 6D 6F 73 6F 6D 65 31 37 20 0D 0A 20 20 20 20 ; omosome17 ..
c5bd35c0h: 20 20 20 20 20 20 20 31 20 73 65 71 75 65 6E 63 ; 1 sequenc
c5bd35d0h: 65 73 3B 20 37 38 2C 37 37 34 2C 37 34 32 20 74 ; es; 78,774,742 t
c5bd35e0h: 6F 74 61 6C 20 6C 65 74 74 65 72 73 0D 0A 0D 0A ; otal letters....
...
#######################################
Demo: python script :
#############################
import os.path
initial_dir=r'D:\human_exons\chr17'
fichier=os.path.join(initial_dir, '10_17.out')
fichin=open(fichier)
ok=0
i=0
for li in fichin:
i+=1
if li.startswith('Query= '):
query=li
elif li.startswith('1|ENSE00001440927|105829519|105829650|-1|1'):
ok=1
if ok==1:
print i
print query
print li
fichin.close()
################################
output :
160968087
Query= 10|ENSG00000203288|pseudogene|105829416|105829650|-
1|ENSE00001440927|105829519|105829650|-1|1 (132 letters)
160968088
Query= 10|ENSG00000203288|pseudogene|105829416|105829650|-
in fact line 160968087, should be 160981763
####################################
Computer
Dell Precision PWS690 2 CPU dual core
Intel Xeon
5160 @ 3.00GHz
2.99 GHz, 16.0 GB of RAM
Microsoft Windows XP
Professional x64 Edition
Version 2003
Windows [Version 5.2.3790]
#####################################
Richard Christen |
|
Date |
User |
Action |
Args |
2007-08-23 14:38:33 | admin | link | issue1451466 messages |
2007-08-23 14:38:33 | admin | create | |
|