This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Use support.TESTFN_UNDECODABLE on UNIX
Type: enhancement Stage: patch review
Components: Tests, Unicode Versions: Python 3.2, Python 3.3, Python 3.4
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: asvetlov, ezio.melotti, pitrou, python-dev, serhiy.storchaka, vstinner
Priority: normal Keywords: patch

Created on 2012-11-08 22:52 by vstinner, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
support_undecodable.patch vstinner, 2012-11-08 22:52 review
Messages (22)
msg175200 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-11-08 22:52
Attached patch changes how support.TESTFN_UNDECODABLE is computed on UNIX: use the filesystem encoding in *strict* mode, not using the surrogateescape error handler.

So we can use support.TESTFN_UNDECODABLE to check if a function uses correctly the surrogateescape error handler and/or check if it behaves correctly with non-ASCII characters.

The patch uses also support.TESTFN_UNDECODABLE (only on UNIX) in test_cmd_line_script.test_non_ascii() to also check that the fix for #16218 works with UTF-8 locale encoding.

Please test the patch on UNIX, Windows and Mac OS X.

We may also use support.TESTFN_UNDECODABLE in test_cmd_line_script.test_non_ascii() on Windows, I will check.

Windows has some strange behaviour with undecodable characters: some of them are replaced a character with a similar glyph.
msg175201 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-11-08 22:53
The patch contains two print to help debugging the patch itself, these print statements must be removed later.
 
+print("TESTFN_UNDECODABLE = %a" % TESTFN_UNDECODABLE)
+print("TESTFN_NONASCII = %a" % TESTFN_NONASCII)
msg175202 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-11-08 23:04
> We may also use support.TESTFN_UNDECODABLE
> in test_cmd_line_script.test_non_ascii() on Windows

Oh, subprocess doesn't support passing bytes arguments to a program anymore (since Python 3.0).
http://bugs.python.org/issue4036#msg100376

So it's better to use TESTFN_NONASCII instead for this test ;-) It confirms that we need two constants depending on the context. It dependson the platform and how the data is read/written: sometimes undecodable characters are supported on any platform (ex: base64 encoder), sometimes undecodable characters are not supported (ex: distutils expects valid metadata), sometimes it depends on the platform (ex: this test).
msg175209 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-11-08 23:50
> Please test the patch on UNIX, Windows and Mac OS X.

The full test suite pass on:

 * Linux with UTF-8 locale encoding
 * Linux with ASCII locale encoding
 * Windows with cp932 ANSI code page
 * Mac OS 10.8 with ASCII locale encoding (and utf-8/surrogateescape for the filesystem encoding) ($LANG, $LC_ALL, $LC_CTYPE are not set)
msg175221 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-11-09 11:00
Try b'\x81', b'\x98', b'\xae', b'\xd5', b'\xff'. They are undecodable in all 1-byte encodings.

b'\x81' : shift_jis_2004 shift_jis shift_jisx0213 cp869 cp874 cp932 cp1250 cp1252 cp1253 cp1254 cp1255 cp1257 cp1258
b'\x98' : shift_jis_2004 shift_jis shift_jisx0213 cp874 cp932 cp1250 cp1251 cp1253 cp1257
b'\xae' : iso8859-3 iso8859-6 iso8859-7 cp424
b'\xd5' : iso8859-8 cp856 cp857
b'\xff' : hp-roman8 iso8859-6 iso8859-7 iso8859-8 iso8859-11 shift_jis_2004 shift_jis shift_jisx0213 tis-620 cp864 cp874 cp1253 cp1255
msg175222 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-11-09 11:09
Try b'\xed\xb2\x80' and b'\xed\xb4\x80' for UTF-8 (on Unix and Mac OS X).

b'\xed\xb2\x80' is b'\x80'.decode('utf-8', 'surrogateescape').encode('utf-8', 'surrogatepass').
b'\xed\xb4\x80' is '\udd00'.encode('utf-8', 'surrogatepass') and '\udd00' can't be encoded with surrogateescape.
msg175223 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-11-09 11:14
> The full test suite pass on:

The matter is not only in the fact that tests passed.  They should fail if the original bug occurs again.  Have you tried to restore the bugs?
msg175271 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-11-10 10:50
> The matter is not only in the fact that tests passed.

Right, but I don't want to introduce a regression :-)

> They should fail if the original bug occurs again.  Have you tried to restore the bugs?

test_cmd_line_script.test_non_ascii() comes from the issue #16218, changeset 23ebe277e982. I checked this issue: support_undecodable.patch checks for non-regression with UTF-8 (and ASCI and ISO-8859-1) locale encoding on UNIX.

test_genericpath.test_non_ascii() comes from the issue #3426, this fix comes from the issue #3187, changeset 8a7c930abab6. I don't want to spend time on trying the new test on this issue because this 8a7c930abab6 is a major change, I don't see how to revert it just to test the issue. I consider the issue has fixed, and the new test should not reduce the test coverage, but just increase it ;-)
msg175272 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2012-11-10 11:07
New changeset 6b8a8bc6ba9c by Victor Stinner in branch 'default':
Issue #16444, #16218: Use TESTFN_UNDECODABLE on UNIX
http://hg.python.org/cpython/rev/6b8a8bc6ba9c
msg175275 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-11-10 12:21
TESTFN_UNDECODABLE is not detected for cp1250, cp1251, cp1252, cp1254, cp1257 and cp1258.  Just add b'\x81\x98\xae\xd5\xff', at leas one of this bytes undecodable in some encoding which has any undecodable bytes.
msg175291 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-11-10 18:24
I suppose you noticed you broke a bunch of buildbots :)
msg175296 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2012-11-10 21:31
New changeset 398f8770bf0d by Victor Stinner in branch 'default':
Issue #16444: disable undecodable characters in test_non_ascii() test until
http://hg.python.org/cpython/rev/398f8770bf0d
msg175396 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-11-11 21:51
> TESTFN_UNDECODABLE is not detected for cp1250, cp1251, cp1252, cp1254, cp1257 and cp1258.

The Python encoding and the real codec used by Windows are different: Python fails to decode bytes 0x80-0x9f, but Windows does decode them. I prefer to avoid these bytes to not rely too much on the Python codec.
msg175399 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-11-11 22:08
These encodings used not only on Windows.
msg175402 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-11-11 22:15
> I suppose you noticed you broke a bunch of buildbots :)

Failures occur on FreeBSD, OpenIndiana and some other buildbots which don't set a locale and so use the "C" locale. main() decodes command line arguments from the locale encoding using _Py_char2wchar(). On these OSes, the "C" locale uses the ISO-8859-1 encoding, but the problem is that nl_langinfo(CODESET) announces ASCII :-/ test_cmd_line.test_undecodable_code() handles this case. Extract of a comment:

# _Py_char2wchar() decoded b'\xff' as '\xff' even if the locale is  
# C and the locale encoding is ASCII. It occurs on FreeBSD, Solaris 
# and Mac OS X.                                                     

Mac OS X is now using UTF-8 to decode the command line arguments.

I just created the issue #16455 to fix FreeBSD and OpenIndiana.

I propose to close this issue because I consider it as fixed (#16455 will reenable TESTFN_UNDECODABLE in test_cmd_line_script).
msg175406 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-11-11 23:12
> These encodings used not only on Windows.

You can uses cpXXX encodings explictly to read or write a file, but these encodings are not used for sys.getfilesystemencoding() (or sys.stdout.encoding).
msg175413 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2012-11-12 00:24
New changeset 6017f09ead53 by Victor Stinner in branch '3.3':
Issue #16218, #16444: Backport improvment on tests for non-ASCII characters
http://hg.python.org/cpython/rev/6017f09ead53
msg175423 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-11-12 08:05
> You can uses cpXXX encodings explictly to read or write a file, but these
> encodings are not used for sys.getfilesystemencoding() (or
> sys.stdout.encoding).

At least CP1251 has been used for many cyrillic locales in before-UTF8 age (I use it sometimes still). For now CP1251 is the default encoding for Byelorussian and Bulgarian:

$ grep CP /usr/share/i18n/SUPPORTED
be_BY CP1251
bg_BG CP1251
ru_RU.CP1251 CP1251
yi_US CP1255
msg176893 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-12-04 10:40
Ping.
msg176955 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2012-12-04 20:42
New changeset ed0ff4b3d1c4 by Victor Stinner in branch 'default':
Issue #16444: test more bytes in support.TESTFN_UNDECODABLE to support more Windows code pages
http://hg.python.org/cpython/rev/ed0ff4b3d1c4
msg176958 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-12-04 20:53
Ooook, all remaining issues about undecodable bytes should now be fixed (until someone opens a new one? :-))
msg178868 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2013-01-03 00:59
New changeset 41658a4fb3cc by Victor Stinner in branch '3.2':
Issue #16218, #16414, #16444: Backport FS_NONASCII, TESTFN_UNDECODABLE,
http://hg.python.org/cpython/rev/41658a4fb3cc

New changeset 4d40c1ce8566 by Victor Stinner in branch '3.3':
(Merge 3.2) Issue #16218, #16414, #16444: Backport FS_NONASCII,
http://hg.python.org/cpython/rev/4d40c1ce8566
History
Date User Action Args
2022-04-11 14:57:38adminsetgithub: 60648
2013-01-03 01:07:37vstinnersetversions: + Python 3.2, Python 3.3
2013-01-03 00:59:43python-devsetmessages: + msg178868
2012-12-04 20:53:30vstinnersetstatus: open -> closed
resolution: fixed
messages: + msg176958
2012-12-04 20:42:00python-devsetmessages: + msg176955
2012-12-04 10:40:07serhiy.storchakasettype: enhancement
messages: + msg176893
stage: patch review
2012-11-15 15:53:18asvetlovsetnosy: + asvetlov
2012-11-12 08:05:43serhiy.storchakasetmessages: + msg175423
2012-11-12 00:24:14python-devsetmessages: + msg175413
2012-11-11 23:12:16vstinnersetmessages: + msg175406
2012-11-11 22:15:48vstinnersetmessages: + msg175402
2012-11-11 22:08:49serhiy.storchakasetmessages: + msg175399
2012-11-11 21:51:52vstinnersetmessages: + msg175396
2012-11-10 21:31:49python-devsetmessages: + msg175296
2012-11-10 18:24:50pitrousetnosy: + pitrou
messages: + msg175291
2012-11-10 12:21:27serhiy.storchakasetmessages: + msg175275
2012-11-10 11:07:35python-devsetnosy: + python-dev
messages: + msg175272
2012-11-10 10:50:07vstinnersetmessages: + msg175271
2012-11-09 11:14:20serhiy.storchakasetmessages: + msg175223
2012-11-09 11:09:49serhiy.storchakasetmessages: + msg175222
2012-11-09 11:00:23serhiy.storchakasetmessages: + msg175221
2012-11-08 23:50:56vstinnersetmessages: + msg175209
2012-11-08 23:04:16vstinnersetmessages: + msg175202
2012-11-08 22:53:12vstinnersetmessages: + msg175201
2012-11-08 22:52:14vstinnercreate