Issue 44510: file.read() UnicodeDecodeError with UTF-8 BOM in files on Windows

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/88676

classification

Title:	file.read() UnicodeDecodeError with UTF-8 BOM in files on Windows
Type:	behavior	Stage:	resolved
Components:	IO, Unicode, Windows	Versions:	Python 3.9, Python 3.6

process

Status:	closed	Resolution:	not a bug
Dependencies:		Superseder:
Assigned To:		Nosy List:	RohanA, eryksun, ezio.melotti, jayman, methane, paul.moore, steve.dower, tim.golden, vstinner, zach.ware
Priority:	normal	Keywords:

Created on 2021-06-25 23:06 by RohanA, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
Bug Reproduction Code.zip	RohanA, 2021-06-25 23:06	Program that causes bug

Messages (3)
msg396532 - (view)	Author: Rohan Amin (RohanA)	Date: 2021-06-25 23:06
When using file.read() with a large text file, there is a UnicodeDecodeError. I expected file.read(1) to read one character from the file. It works with a smaller text file. I experienced this bug on Windows 10 version 20H2. My teacher couldn't reproduce this bug on Linux.
msg396534 - (view)	Author: Steve Dower (steve.dower) *	Date: 2021-06-26 00:25
The file that fails contains a UTF-8 BOM at the start, which is a multibyte character indicating that the file is definitely UTF-8. Unfortunately, none of Python's default settings will handle this, because it's a convention that only really exists on Windows. On Windows we currently still default to your console encoding, since that is what we have always done and changing it by default is very complex. Apparently your console encoding does not include the character represented by the first byte of the BOM - in any case, it's not a character you'd ever want to see, so if it _had_ worked, you'd just have garbage in your read data. The immediate fix for your scenario is to use "open(filename, 'r', encoding='utf-8-sig')" which will handle the BOM correctly. For the core team, I still think it's worth having the default encoding be able to read and drop the UTF-8 BOM from the start of a file. Since we shouldn't do it for any arbitrary operation (which may not be at the start of a file), it'd have to be a special default object for the TextIOWrapper case, but it would have solved this issue. If the BOM is there, it can switch to UTF-8 (or UTF-16, if that BOM exists); if not, it can use whatever the default would have been (based on all the other available settings).
msg396535 - (view)	Author: Eryk Sun (eryksun) *	Date: 2021-06-26 02:08
> On Windows we currently still default to your console encoding In Windows, the default encoding for open() is the ANSI code page of the current process [1], from GetACP(), which is based on the system locale, unless it's overridden to UTF-8 in the application manifest. The console encoding is unrelated and not something we use much anymore since io._WindowsConsoleIO was introduced in Python 3.6.

History
Date	User	Action	Args
2022-04-11 14:59:47	admin	set	github: 88676
2021-06-26 02:08:36	eryksun	set	status: open -> closed versions: + Python 3.6, Python 3.9, - Python 3.11 nosy: + eryksun messages: + msg396535 resolution: not a bug stage: resolved
2021-06-26 00:25:16	steve.dower	set	nosy: + methane title: file.read() UnicodeDecodeError with large files on Windows -> file.read() UnicodeDecodeError with UTF-8 BOM in files on Windows messages: + msg396534 versions: + Python 3.11, - Python 3.6, Python 3.9
2021-06-25 23:06:54	jayman	set	nosy: + jayman
2021-06-25 23:06:12	RohanA	create