Issue 28246: Unable to read simple text file

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/72433

classification

Title:	Unable to read simple text file
Type:	behavior	Stage:	resolved
Components:	IO, Unicode, Windows	Versions:	Python 3.6, Python 3.5

process

Status:	closed	Resolution:	not a bug
Dependencies:		Superseder:
Assigned To:		Nosy List:	AndreyTomsk, SilentGhost, eryksun, ezio.melotti, paul.moore, steve.dower, tim.golden, vstinner, zach.ware
Priority:	normal	Keywords:

Created on 2016-09-22 08:15 by AndreyTomsk, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
ResourceStrings.rc	AndreyTomsk, 2016-09-22 08:15	problematic text file

Messages (5)
msg277206 - (view)	Author: (AndreyTomsk)	Date: 2016-09-22 08:15
File read operation fails when gets specific cyrillic symbol. Tested with script: testFile = open('ResourceStrings.rc', 'r') for line in testFile: print(line) Exception message: Traceback (most recent call last): File "min_test.py", line 6, in <module> for line in testFile: File "C:\Users\afi\AppData\Local\Programs\Python\Python36\lib\encodings\cp1251.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x98 in position 24: character maps to <undefined>
msg277207 - (view)	Author: Eryk Sun (eryksun) *	Date: 2016-09-22 08:29
The default encoding on your system is Windows codepage 1251. However, your file is encoded using UTF-8: >>> lines = open('ResourceStrings.rc', 'rb').read().splitlines() >>> print(*lines, sep='\n') b'\xef\xbb\xbf\xd0\x90 (cyrillic A)' b'\xd0\x98 (cyrillic I) <<< line read fails' b'\xd0\x91 (cyrillic B)' It even has a UTF-8 BOM (i.e. b'\xef\xbb\xbf'). You need to pass the encoding to built-in open(): >>> print(open('ResourceStrings.rc', encoding='utf-8').read()) А (cyrillic A) И (cyrillic I) <<< line read fails Б (cyrillic B)
msg277210 - (view)	Author: SilentGhost (SilentGhost) *	Date: 2016-09-22 08:50
It would be good to add a FAQ / HowTo entry for this question.
msg277214 - (view)	Author: (AndreyTomsk)	Date: 2016-09-22 10:18
Thanks for quick reply. I'm new to python, just used tutorial docs and didn't read carefully enough to notice encoding info. Still, IMHO behaviour not consistent. In three sequential symbols in russian alphabet - З, И, К, it crashes on И, and displays other in two-byte form.
msg277215 - (view)	Author: Eryk Sun (eryksun) *	Date: 2016-09-22 10:33
Codepage 1251 is a single-byte encoding and a superset of ASCII (i.e. ordinals 0-127). UTF-8 is also a superset of ASCII, so there's no problem as long as the encoded text is strictly ASCII. But decoding non-ASCII UTF-8 as codepage 1251 produces nonsense, otherwise known as mojibake. It happens that codepage 1251 maps every one of the 256 possible byte values, except for 0x98 (152). The exception can't be made any clearer.

History
Date	User	Action	Args
2022-04-11 14:58:37	admin	set	github: 72433
2016-09-22 10:33:53	eryksun	set	messages: + msg277215
2016-09-22 10:18:17	AndreyTomsk	set	messages: + msg277214
2016-09-22 08:50:38	SilentGhost	set	nosy: + SilentGhost messages: + msg277210
2016-09-22 08:29:10	eryksun	set	status: open -> closed nosy: + eryksun messages: + msg277207 resolution: not a bug stage: resolved
2016-09-22 08:15:11	AndreyTomsk	create