This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: file.read() UnicodeDecodeError with UTF-8 BOM in files on Windows
Type: behavior Stage: resolved
Components: IO, Unicode, Windows Versions: Python 3.9, Python 3.6
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: RohanA, eryksun, ezio.melotti, jayman, methane, paul.moore, steve.dower, tim.golden, vstinner, zach.ware
Priority: normal Keywords:

Created on 2021-06-25 23:06 by RohanA, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
Bug Reproduction Code.zip RohanA, 2021-06-25 23:06 Program that causes bug
Messages (3)
msg396532 - (view) Author: Rohan Amin (RohanA) Date: 2021-06-25 23:06
When using file.read() with a large text file,  
there is a UnicodeDecodeError. I expected file.read(1) to read one character from the file. It works with a smaller text file. I experienced this bug on  Windows 10 version 20H2. My teacher couldn't reproduce this bug on Linux.
msg396534 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2021-06-26 00:25
The file that fails contains a UTF-8 BOM at the start, which is a multibyte character indicating that the file is definitely UTF-8.

Unfortunately, none of Python's default settings will handle this, because it's a convention that only really exists on Windows.

On Windows we currently still default to your console encoding, since that is what we have always done and changing it by default is very complex. Apparently your console encoding does not include the character represented by the first byte of the BOM - in any case, it's not a character you'd ever want to see, so if it _had_ worked, you'd just have garbage in your read data.

The immediate fix for your scenario is to use "open(filename, 'r', encoding='utf-8-sig')" which will handle the BOM correctly.

For the core team, I still think it's worth having the default encoding be able to read and drop the UTF-8 BOM from the start of a file. Since we shouldn't do it for any arbitrary operation (which may not be at the start of a file), it'd have to be a special default object for the TextIOWrapper case, but it would have solved this issue. If the BOM is there, it can switch to UTF-8 (or UTF-16, if that BOM exists); if not, it can use whatever the default would have been (based on all the other available settings).
msg396535 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2021-06-26 02:08
> On Windows we currently still default to your console encoding

In Windows, the default encoding for open() is the ANSI code page of the current process [1], from GetACP(), which is based on the system locale, unless it's overridden to UTF-8 in the application manifest. The console encoding is unrelated and not something we use much anymore since io._WindowsConsoleIO was introduced in Python 3.6.
History
Date User Action Args
2022-04-11 14:59:47adminsetgithub: 88676
2021-06-26 02:08:36eryksunsetstatus: open -> closed

versions: + Python 3.6, Python 3.9, - Python 3.11
nosy: + eryksun

messages: + msg396535
resolution: not a bug
stage: resolved
2021-06-26 00:25:16steve.dowersetnosy: + methane
title: file.read() UnicodeDecodeError with large files on Windows -> file.read() UnicodeDecodeError with UTF-8 BOM in files on Windows
messages: + msg396534

versions: + Python 3.11, - Python 3.6, Python 3.9
2021-06-25 23:06:54jaymansetnosy: + jayman
2021-06-25 23:06:12RohanAcreate