This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Unable to read simple text file
Type: behavior Stage: resolved
Components: IO, Unicode, Windows Versions: Python 3.6, Python 3.5
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: AndreyTomsk, SilentGhost, eryksun, ezio.melotti, paul.moore, steve.dower, tim.golden, vstinner, zach.ware
Priority: normal Keywords:

Created on 2016-09-22 08:15 by AndreyTomsk, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
ResourceStrings.rc AndreyTomsk, 2016-09-22 08:15 problematic text file
Messages (5)
msg277206 - (view) Author: (AndreyTomsk) Date: 2016-09-22 08:15
File read operation fails when gets specific cyrillic symbol. Tested with script:

testFile = open('ResourceStrings.rc', 'r')
for line in testFile:
    print(line)


Exception message:
Traceback (most recent call last):
  File "min_test.py", line 6, in <module>
    for line in testFile:
  File "C:\Users\afi\AppData\Local\Programs\Python\Python36\lib\encodings\cp1251.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x98 in position 24: character maps to <undefined>
msg277207 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2016-09-22 08:29
The default encoding on your system is Windows codepage 1251. However, your file is encoded using UTF-8:

    >>> lines = open('ResourceStrings.rc', 'rb').read().splitlines()
    >>> print(*lines, sep='\n')
    b'\xef\xbb\xbf\xd0\x90 (cyrillic A)'
    b'\xd0\x98 (cyrillic I) <<< line read fails'
    b'\xd0\x91 (cyrillic B)'

It even has a UTF-8 BOM (i.e. b'\xef\xbb\xbf'). You need to pass the encoding to built-in open():

    >>> print(open('ResourceStrings.rc', encoding='utf-8').read())
    А (cyrillic A)
    И (cyrillic I) <<< line read fails
    Б (cyrillic B)
msg277210 - (view) Author: SilentGhost (SilentGhost) * (Python triager) Date: 2016-09-22 08:50
It would be good to add a FAQ / HowTo entry for this question.
msg277214 - (view) Author: (AndreyTomsk) Date: 2016-09-22 10:18
Thanks for quick reply. I'm new to python, just used tutorial docs and didn't read carefully enough to notice encoding info.

Still, IMHO behaviour not consistent. In three sequential symbols in russian alphabet - З, И, К, it crashes on И, and displays other in two-byte form.
msg277215 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2016-09-22 10:33
Codepage 1251 is a single-byte encoding and a superset of ASCII (i.e. ordinals 0-127). UTF-8 is also a superset of ASCII, so there's no problem as long as the encoded text is strictly ASCII. But decoding non-ASCII UTF-8 as codepage 1251 produces nonsense, otherwise known as mojibake. It happens that codepage 1251 maps every one of the 256 possible byte values, except for 0x98 (152). The exception can't be made any clearer.
History
Date User Action Args
2022-04-11 14:58:37adminsetgithub: 72433
2016-09-22 10:33:53eryksunsetmessages: + msg277215
2016-09-22 10:18:17AndreyTomsksetmessages: + msg277214
2016-09-22 08:50:38SilentGhostsetnosy: + SilentGhost
messages: + msg277210
2016-09-22 08:29:10eryksunsetstatus: open -> closed

nosy: + eryksun
messages: + msg277207

resolution: not a bug
stage: resolved
2016-09-22 08:15:11AndreyTomskcreate