Title: UTF-8 encoding not enforced
Components: Unicode Versions: Python 3.4
Nosy List: benjamin.peterson, ezio.melotti, jwilk, loewis, serhiy.storchaka, vstinner
Created on 2013-12-10 11:48 by jwilk, last changed 2022-04-11 14:57 by admin.

File name Uploaded Description Edit jwilk, 2013-12-10 11:49 jwilk, 2013-12-11 10:42
Messages (3)
msg205795 - (view) Author: Jakub Wilk (jwilk) Date: 2013-12-10 11:48
I created a Python file which contained a non-UTF-8 string literal (but no Unicode literals), and added "UTF-8" encoding declaration to it. I expected that Python will raise SyntaxError when importing such module, but it doesn't:

$ python --version
Python 2.7.6

$ python -c 'import test1' && echo ok

Curiously enough, if I change the declaration to "UTF8", then the exception is raised as expected:

$ sed -e 's/UTF-8/UTF8/' < >
$ python -c 'import test2'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "", line 2
SyntaxError: 'utf8' codec can't decode byte 0xa1 in position 5: invalid start byte
msg205827 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2013-12-10 15:24
Yes, this is a silly bug where we "shortcut" decoding of utf-8 files by not checking if its valid UTF-8. However, this behavior has been around for a long time, so I'm not going to change it in 2.7.x.
msg205901 - (view) Author: Jakub Wilk (jwilk) Date: 2013-12-11 10:42
With a slightly adapted test case, I see the same behavior in Python 3.3.3. Perhaps it would be worth fixing the bug in Python 3.4?
