Use support.TESTFN_UNDECODABLE on UNIX #60648

vstinner · 2012-11-08T22:52:14Z

BPO	16444
Nosy	@pitrou, @vstinner, @ezio-melotti, @asvetlov, @serhiy-storchaka
Files	support_undecodable.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2012-12-04.20:53:30.192>
created_at = <Date 2012-11-08.22:52:14.333>
labels = ['type-feature', 'tests', 'expert-unicode']
title = 'Use support.TESTFN_UNDECODABLE on UNIX'
updated_at = <Date 2013-01-03.01:07:37.018>
user = 'https://github.com/vstinner'

bugs.python.org fields:

activity = <Date 2013-01-03.01:07:37.018>
actor = 'vstinner'
assignee = 'none'
closed = True
closed_date = <Date 2012-12-04.20:53:30.192>
closer = 'vstinner'
components = ['Tests', 'Unicode']
creation = <Date 2012-11-08.22:52:14.333>
creator = 'vstinner'
dependencies = []
files = ['27928']
hgrepos = []
issue_num = 16444
keywords = ['patch']
message_count = 22.0
messages = ['175200', '175201', '175202', '175209', '175221', '175222', '175223', '175271', '175272', '175275', '175291', '175296', '175396', '175399', '175402', '175406', '175413', '175423', '176893', '176955', '176958', '178868']
nosy_count = 6.0
nosy_names = ['pitrou', 'vstinner', 'ezio.melotti', 'asvetlov', 'python-dev', 'serhiy.storchaka']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = 'patch review'
status = 'closed'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue16444'
versions = ['Python 3.2', 'Python 3.3', 'Python 3.4']

vstinner · 2012-11-08T22:52:12Z

Attached patch changes how support.TESTFN_UNDECODABLE is computed on UNIX: use the filesystem encoding in *strict* mode, not using the surrogateescape error handler.

So we can use support.TESTFN_UNDECODABLE to check if a function uses correctly the surrogateescape error handler and/or check if it behaves correctly with non-ASCII characters.

The patch uses also support.TESTFN_UNDECODABLE (only on UNIX) in test_cmd_line_script.test_non_ascii() to also check that the fix for bpo-16218 works with UTF-8 locale encoding.

Please test the patch on UNIX, Windows and Mac OS X.

We may also use support.TESTFN_UNDECODABLE in test_cmd_line_script.test_non_ascii() on Windows, I will check.

Windows has some strange behaviour with undecodable characters: some of them are replaced a character with a similar glyph.

vstinner · 2012-11-08T22:53:12Z

The patch contains two print to help debugging the patch itself, these print statements must be removed later.

+print("TESTFN_UNDECODABLE = %a" % TESTFN_UNDECODABLE)
+print("TESTFN_NONASCII = %a" % TESTFN_NONASCII)

vstinner · 2012-11-08T23:04:17Z

We may also use support.TESTFN_UNDECODABLE
in test_cmd_line_script.test_non_ascii() on Windows

Oh, subprocess doesn't support passing bytes arguments to a program anymore (since Python 3.0).
http://bugs.python.org/issue4036#msg100376

So it's better to use TESTFN_NONASCII instead for this test ;-) It confirms that we need two constants depending on the context. It dependson the platform and how the data is read/written: sometimes undecodable characters are supported on any platform (ex: base64 encoder), sometimes undecodable characters are not supported (ex: distutils expects valid metadata), sometimes it depends on the platform (ex: this test).

vstinner · 2012-11-08T23:50:56Z

Please test the patch on UNIX, Windows and Mac OS X.

The full test suite pass on:

Linux with UTF-8 locale encoding
Linux with ASCII locale encoding
Windows with cp932 ANSI code page
Mac OS 10.8 with ASCII locale encoding (and utf-8/surrogateescape for the filesystem encoding) ($LANG, $LC_ALL, $LC_CTYPE are not set)

serhiy-storchaka · 2012-11-09T11:00:23Z

Try b'\x81', b'\x98', b'\xae', b'\xd5', b'\xff'. They are undecodable in all 1-byte encodings.

b'\x81' : shift_jis_2004 shift_jis shift_jisx0213 cp869 cp874 cp932 cp1250 cp1252 cp1253 cp1254 cp1255 cp1257 cp1258
b'\x98' : shift_jis_2004 shift_jis shift_jisx0213 cp874 cp932 cp1250 cp1251 cp1253 cp1257
b'\xae' : iso8859-3 iso8859-6 iso8859-7 cp424
b'\xd5' : iso8859-8 cp856 cp857
b'\xff' : hp-roman8 iso8859-6 iso8859-7 iso8859-8 iso8859-11 shift_jis_2004 shift_jis shift_jisx0213 tis-620 cp864 cp874 cp1253 cp1255

serhiy-storchaka · 2012-11-09T11:09:49Z

Try b'\xed\xb2\x80' and b'\xed\xb4\x80' for UTF-8 (on Unix and Mac OS X).

b'\xed\xb2\x80' is b'\x80'.decode('utf-8', 'surrogateescape').encode('utf-8', 'surrogatepass').
b'\xed\xb4\x80' is '\udd00'.encode('utf-8', 'surrogatepass') and '\udd00' can't be encoded with surrogateescape.

serhiy-storchaka · 2012-11-09T11:14:21Z

The full test suite pass on:

The matter is not only in the fact that tests passed. They should fail if the original bug occurs again. Have you tried to restore the bugs?

vstinner · 2012-11-10T10:50:07Z

The matter is not only in the fact that tests passed.

Right, but I don't want to introduce a regression :-)

They should fail if the original bug occurs again. Have you tried to restore the bugs?

test_cmd_line_script.test_non_ascii() comes from the issue bpo-16218, changeset 23ebe277e982. I checked this issue: support_undecodable.patch checks for non-regression with UTF-8 (and ASCI and ISO-8859-1) locale encoding on UNIX.

test_genericpath.test_non_ascii() comes from the issue bpo-3426, this fix comes from the issue bpo-3187, changeset 8a7c930abab6. I don't want to spend time on trying the new test on this issue because this 8a7c930abab6 is a major change, I don't see how to revert it just to test the issue. I consider the issue has fixed, and the new test should not reduce the test coverage, but just increase it ;-)

python-dev · 2012-11-10T11:07:35Z

New changeset 6b8a8bc6ba9c by Victor Stinner in branch 'default':
Issue bpo-16444, bpo-16218: Use TESTFN_UNDECODABLE on UNIX
http://hg.python.org/cpython/rev/6b8a8bc6ba9c

serhiy-storchaka · 2012-11-10T12:21:28Z

TESTFN_UNDECODABLE is not detected for cp1250, cp1251, cp1252, cp1254, cp1257 and cp1258. Just add b'\x81\x98\xae\xd5\xff', at leas one of this bytes undecodable in some encoding which has any undecodable bytes.

pitrou · 2012-11-10T18:24:50Z

I suppose you noticed you broke a bunch of buildbots :)

python-dev · 2012-11-10T21:31:50Z

New changeset 398f8770bf0d by Victor Stinner in branch 'default':
Issue bpo-16444: disable undecodable characters in test_non_ascii() test until
http://hg.python.org/cpython/rev/398f8770bf0d

vstinner · 2012-11-11T21:51:53Z

TESTFN_UNDECODABLE is not detected for cp1250, cp1251, cp1252, cp1254, cp1257 and cp1258.

The Python encoding and the real codec used by Windows are different: Python fails to decode bytes 0x80-0x9f, but Windows does decode them. I prefer to avoid these bytes to not rely too much on the Python codec.

serhiy-storchaka · 2012-11-11T22:08:50Z

These encodings used not only on Windows.

vstinner · 2012-11-11T22:15:48Z

I suppose you noticed you broke a bunch of buildbots :)

Failures occur on FreeBSD, OpenIndiana and some other buildbots which don't set a locale and so use the "C" locale. main() decodes command line arguments from the locale encoding using _Py_char2wchar(). On these OSes, the "C" locale uses the ISO-8859-1 encoding, but the problem is that nl_langinfo(CODESET) announces ASCII :-/ test_cmd_line.test_undecodable_code() handles this case. Extract of a comment:

# _Py_char2wchar() decoded b'\xff' as '\xff' even if the locale is
# C and the locale encoding is ASCII. It occurs on FreeBSD, Solaris
# and Mac OS X.

Mac OS X is now using UTF-8 to decode the command line arguments.

I just created the issue bpo-16455 to fix FreeBSD and OpenIndiana.

I propose to close this issue because I consider it as fixed (bpo-16455 will reenable TESTFN_UNDECODABLE in test_cmd_line_script).

vstinner · 2012-11-11T23:12:17Z

These encodings used not only on Windows.

You can uses cpXXX encodings explictly to read or write a file, but these encodings are not used for sys.getfilesystemencoding() (or sys.stdout.encoding).

python-dev · 2012-11-12T00:24:15Z

New changeset 6017f09ead53 by Victor Stinner in branch '3.3':
Issue bpo-16218, bpo-16444: Backport improvment on tests for non-ASCII characters
http://hg.python.org/cpython/rev/6017f09ead53

serhiy-storchaka · 2012-11-12T08:05:43Z

You can uses cpXXX encodings explictly to read or write a file, but these
encodings are not used for sys.getfilesystemencoding() (or
sys.stdout.encoding).

At least CP1251 has been used for many cyrillic locales in before-UTF8 age (I use it sometimes still). For now CP1251 is the default encoding for Byelorussian and Bulgarian:

$ grep CP /usr/share/i18n/SUPPORTED
be_BY CP1251
bg_BG CP1251
ru_RU.CP1251 CP1251
yi_US CP1255

serhiy-storchaka · 2012-12-04T10:40:08Z

Ping.

python-dev · 2012-12-04T20:42:01Z

New changeset ed0ff4b3d1c4 by Victor Stinner in branch 'default':
Issue bpo-16444: test more bytes in support.TESTFN_UNDECODABLE to support more Windows code pages
http://hg.python.org/cpython/rev/ed0ff4b3d1c4

vstinner · 2012-12-04T20:53:30Z

Ooook, all remaining issues about undecodable bytes should now be fixed (until someone opens a new one? :-))

python-dev · 2013-01-03T00:59:41Z

New changeset 41658a4fb3cc by Victor Stinner in branch '3.2':
Issue bpo-16218, bpo-16414, bpo-16444: Backport FS_NONASCII, TESTFN_UNDECODABLE,
http://hg.python.org/cpython/rev/41658a4fb3cc

New changeset 4d40c1ce8566 by Victor Stinner in branch '3.3':
(Merge 3.2) Issue bpo-16218, bpo-16414, bpo-16444: Backport FS_NONASCII,
http://hg.python.org/cpython/rev/4d40c1ce8566

vstinner added tests Tests in the Lib/test dir topic-unicode labels Nov 8, 2012

serhiy-storchaka added the type-feature A feature request or enhancement label Dec 4, 2012

vstinner closed this as completed Dec 4, 2012

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use support.TESTFN_UNDECODABLE on UNIX #60648

Use support.TESTFN_UNDECODABLE on UNIX #60648

vstinner commented Nov 8, 2012

vstinner commented Nov 8, 2012

vstinner commented Nov 8, 2012

vstinner commented Nov 8, 2012

vstinner commented Nov 8, 2012

serhiy-storchaka commented Nov 9, 2012

serhiy-storchaka commented Nov 9, 2012

serhiy-storchaka commented Nov 9, 2012

vstinner commented Nov 10, 2012

python-dev mannequin commented Nov 10, 2012

serhiy-storchaka commented Nov 10, 2012

pitrou commented Nov 10, 2012

python-dev mannequin commented Nov 10, 2012

vstinner commented Nov 11, 2012

serhiy-storchaka commented Nov 11, 2012

vstinner commented Nov 11, 2012

vstinner commented Nov 11, 2012

python-dev mannequin commented Nov 12, 2012

serhiy-storchaka commented Nov 12, 2012

serhiy-storchaka commented Dec 4, 2012

python-dev mannequin commented Dec 4, 2012

vstinner commented Dec 4, 2012

python-dev mannequin commented Jan 3, 2013

Use support.TESTFN_UNDECODABLE on UNIX #60648

Use support.TESTFN_UNDECODABLE on UNIX #60648

Comments

vstinner commented Nov 8, 2012

vstinner commented Nov 8, 2012

vstinner commented Nov 8, 2012

vstinner commented Nov 8, 2012

vstinner commented Nov 8, 2012

serhiy-storchaka commented Nov 9, 2012

serhiy-storchaka commented Nov 9, 2012

serhiy-storchaka commented Nov 9, 2012

vstinner commented Nov 10, 2012

python-dev mannequin commented Nov 10, 2012

serhiy-storchaka commented Nov 10, 2012

pitrou commented Nov 10, 2012

python-dev mannequin commented Nov 10, 2012

vstinner commented Nov 11, 2012

serhiy-storchaka commented Nov 11, 2012

vstinner commented Nov 11, 2012

vstinner commented Nov 11, 2012

python-dev mannequin commented Nov 12, 2012

serhiy-storchaka commented Nov 12, 2012

serhiy-storchaka commented Dec 4, 2012

python-dev mannequin commented Dec 4, 2012

vstinner commented Dec 4, 2012

python-dev mannequin commented Jan 3, 2013