Title: .readline() returned garble text
Type: behavior Stage: resolved
Components: IDLE Versions: Python 3.3
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: m123orning, r.david.murray
Priority: normal Keywords:

Created on 2014-01-27 17:11 by m123orning, last changed 2014-01-27 17:27 by r.david.murray. This issue is now closed.

File name Uploaded Description Edit
weird1.txt m123orning, 2014-01-27 17:11
Messages (2)
msg209452 - (view) Author: Xiaoqing Rong (m123orning) Date: 2014-01-27 17:11
I'm using Windows 8. I created file 'weird1.txt' (attached) from an Excel worksheet using "save as Unicode Text (*.txt)". And this happened when I used Python 3.3.3 (v3.3.3:c3896275c0f6, Nov 18 2013, 21:19:30) [MSC v.1600 64 bit (AMD64)] on win32:

>>> handle = open('weird1.txt'); handle.readline()
'ÿþ>\x00P\x006\x004\x00;\x00Y\x00A\x00L\x000\x000\x001\x00C\x00;\x00T\x00F\x00C\x003\x00;\x00 \x00S\x00G\x00D\x00I\x00D\x00:\x00S\x000\x000\x000\x000\x000\x000\x000\x000\x001\x00,\x00 \x00C\x00h\x00r\x00 \x00I\x00 \x00f\x00r\x00o\x00m\x00 \x001\x005\x001\x000\x000\x006\x00-\x001\x004\x007\x005\x009\x004\x00,\x001\x005\x001\x001\x006\x006\x00-\x001\x005\x001\x000\x009\x007\x00,\x00 \x00r\x00e\x00v\x00e\x00r\x00s\x00e\x00 \x00c\x00o\x00m\x00p\x00l\x00e\x00m\x00e\x00n\x00t\x00,\x00 \x00V\x00e\x00r\x00i\x00f\x00i\x00e\x00d\x00 \x00O\x00R\x00F\x00,\x00 \x00"\x00L\x00a\x00r\x00g\x00e\x00s\x00t\x00 \x00o\x00f\x00 \x00s\x00i\x00x\x00 \x00s\x00u\x00b\x00u\x00n\x00i\x00t\x00s\x00 \x00o\x00f\x00 \x00t\x00h\x00e\x00 \x00R\x00N\x00A\x00 \x00p\x00o\x00l\x00y\x00m\x00e\x00r\x00a\x00s\x00e\x00 \x00I\x00I\x00I\x00 \x00t\x00r\x00a\x00n\x00s\x00c\x00r\x00i\x00p\x00t\x00i\x00o\x00n\x00 \x00i\x00n\x00i\x00t\x00i\x00a\x00t\x00i\x00o\x00n\x00 \x00f\x00a\x00c\x00t\x00o\x00r\x00 \x00c\x00o\x00m\x00p\x00l\x00e\x00x\x00 \x00(\x00T\x00F\x00I\x00I\x00I\x00C\x00)\x00;\x00 \x00p\x00a\x00r\x00t\x00 \x00o\x00f\x00 \x00t\x00h\x00e\x00 \x00T\x00a\x00u\x00B\x00 \x00d\x00o\x00m\x00a\x00i\x00n\x00 \x00o\x00f\x00 \x00T\x00F\x00I\x00I\x00I\x00C\x00 \x00t\x00h\x00a\x00t\x00 \x00b\x00i\x00n\x00d\x00s\x00 \x00D\x00N\x00A\x00 \x00a\x00t\x00 \x00t\x00h\x00e\x00 \x00B\x00o\x00x\x00B\x00 \x00p\x00r\x00o\x00m\x00o\x00t\x00e\x00r\x00 \x00s\x00i\x00t\x00e\x00s\x00 \x00o\x00f\x00 \x00t\x00R\x00N\x00A\x00 \x00a\x00n\x00d\x00 \x00s\x00i\x00m\x00i\x00l\x00a\x00r\x00 \x00g\x00e\x00n\x00e\x00s\x00;\x00 \x00c\x00o\x00o\x00p\x00e\x00\n'

Then I opened 'weird1.txt' in Notepad++ 6.5.2, created file 'weird2.txt' by copying the whole content of 'weird1.txt' into a new file and saved it in Notepad++ 6.5.2 (I wanted to attach 'weird2.txt' but only one attachment is allowed), and this happened:

>>> handle = open('weird2.txt'); handle.readline()
'>P64;YAL001C;TFC3; SGDID:S000000001, Chr I from 151006-147594,151166-151097, reverse complement, Verified ORF, "Largest of six subunits of the RNA polymerase III transcription initiation factor complex (TFIIIC); part of the TauB domain of TFIIIC that binds DNA at the BoxB promoter sites of tRNA and similar genes; coope\n'

I can't see any difference between the contents of 'weird1.txt' and 'weird2.txt' using Notepad++ or the Windows Notepad. Maybe some experts could tell me what's going on here?
msg209453 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-01-27 17:27
The file use different encodings.  In the first case, the first two bytes (which don't appear in the second example) I believe are the BOM.  I'm not an expert, but I believe it is a utf-16 file (thus all the \x00 bytes).  The second file is presumably utf-8, with no BOM.  Notepad++ handles both automatically.  For Python, you have to tell it to look for the BOM by specifying the appropriate codec in the open call.  This is because Python's philosophy is to not guess at the encoding of files (though it does have a default encoding, usually utf-8).

Questions like this are better directed to the python-list mailing list, by the way.
Date User Action Args
2014-01-27 17:27:02r.david.murraysetstatus: open -> closed

nosy: + r.david.murray
messages: + msg209453

resolution: not a bug
stage: resolved
2014-01-27 17:11:38m123orningcreate