msg175200 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2012-11-08 22:52 |
Attached patch changes how support.TESTFN_UNDECODABLE is computed on UNIX: use the filesystem encoding in *strict* mode, not using the surrogateescape error handler.
So we can use support.TESTFN_UNDECODABLE to check if a function uses correctly the surrogateescape error handler and/or check if it behaves correctly with non-ASCII characters.
The patch uses also support.TESTFN_UNDECODABLE (only on UNIX) in test_cmd_line_script.test_non_ascii() to also check that the fix for #16218 works with UTF-8 locale encoding.
Please test the patch on UNIX, Windows and Mac OS X.
We may also use support.TESTFN_UNDECODABLE in test_cmd_line_script.test_non_ascii() on Windows, I will check.
Windows has some strange behaviour with undecodable characters: some of them are replaced a character with a similar glyph.
|
msg175201 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2012-11-08 22:53 |
The patch contains two print to help debugging the patch itself, these print statements must be removed later.
+print("TESTFN_UNDECODABLE = %a" % TESTFN_UNDECODABLE)
+print("TESTFN_NONASCII = %a" % TESTFN_NONASCII)
|
msg175202 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2012-11-08 23:04 |
> We may also use support.TESTFN_UNDECODABLE
> in test_cmd_line_script.test_non_ascii() on Windows
Oh, subprocess doesn't support passing bytes arguments to a program anymore (since Python 3.0).
http://bugs.python.org/issue4036#msg100376
So it's better to use TESTFN_NONASCII instead for this test ;-) It confirms that we need two constants depending on the context. It dependson the platform and how the data is read/written: sometimes undecodable characters are supported on any platform (ex: base64 encoder), sometimes undecodable characters are not supported (ex: distutils expects valid metadata), sometimes it depends on the platform (ex: this test).
|
msg175209 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2012-11-08 23:50 |
> Please test the patch on UNIX, Windows and Mac OS X.
The full test suite pass on:
* Linux with UTF-8 locale encoding
* Linux with ASCII locale encoding
* Windows with cp932 ANSI code page
* Mac OS 10.8 with ASCII locale encoding (and utf-8/surrogateescape for the filesystem encoding) ($LANG, $LC_ALL, $LC_CTYPE are not set)
|
msg175221 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2012-11-09 11:00 |
Try b'\x81', b'\x98', b'\xae', b'\xd5', b'\xff'. They are undecodable in all 1-byte encodings.
b'\x81' : shift_jis_2004 shift_jis shift_jisx0213 cp869 cp874 cp932 cp1250 cp1252 cp1253 cp1254 cp1255 cp1257 cp1258
b'\x98' : shift_jis_2004 shift_jis shift_jisx0213 cp874 cp932 cp1250 cp1251 cp1253 cp1257
b'\xae' : iso8859-3 iso8859-6 iso8859-7 cp424
b'\xd5' : iso8859-8 cp856 cp857
b'\xff' : hp-roman8 iso8859-6 iso8859-7 iso8859-8 iso8859-11 shift_jis_2004 shift_jis shift_jisx0213 tis-620 cp864 cp874 cp1253 cp1255
|
msg175222 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2012-11-09 11:09 |
Try b'\xed\xb2\x80' and b'\xed\xb4\x80' for UTF-8 (on Unix and Mac OS X).
b'\xed\xb2\x80' is b'\x80'.decode('utf-8', 'surrogateescape').encode('utf-8', 'surrogatepass').
b'\xed\xb4\x80' is '\udd00'.encode('utf-8', 'surrogatepass') and '\udd00' can't be encoded with surrogateescape.
|
msg175223 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2012-11-09 11:14 |
> The full test suite pass on:
The matter is not only in the fact that tests passed. They should fail if the original bug occurs again. Have you tried to restore the bugs?
|
msg175271 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2012-11-10 10:50 |
> The matter is not only in the fact that tests passed.
Right, but I don't want to introduce a regression :-)
> They should fail if the original bug occurs again. Have you tried to restore the bugs?
test_cmd_line_script.test_non_ascii() comes from the issue #16218, changeset 23ebe277e982. I checked this issue: support_undecodable.patch checks for non-regression with UTF-8 (and ASCI and ISO-8859-1) locale encoding on UNIX.
test_genericpath.test_non_ascii() comes from the issue #3426, this fix comes from the issue #3187, changeset 8a7c930abab6. I don't want to spend time on trying the new test on this issue because this 8a7c930abab6 is a major change, I don't see how to revert it just to test the issue. I consider the issue has fixed, and the new test should not reduce the test coverage, but just increase it ;-)
|
msg175272 - (view) |
Author: Roundup Robot (python-dev) |
Date: 2012-11-10 11:07 |
New changeset 6b8a8bc6ba9c by Victor Stinner in branch 'default':
Issue #16444, #16218: Use TESTFN_UNDECODABLE on UNIX
http://hg.python.org/cpython/rev/6b8a8bc6ba9c
|
msg175275 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2012-11-10 12:21 |
TESTFN_UNDECODABLE is not detected for cp1250, cp1251, cp1252, cp1254, cp1257 and cp1258. Just add b'\x81\x98\xae\xd5\xff', at leas one of this bytes undecodable in some encoding which has any undecodable bytes.
|
msg175291 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2012-11-10 18:24 |
I suppose you noticed you broke a bunch of buildbots :)
|
msg175296 - (view) |
Author: Roundup Robot (python-dev) |
Date: 2012-11-10 21:31 |
New changeset 398f8770bf0d by Victor Stinner in branch 'default':
Issue #16444: disable undecodable characters in test_non_ascii() test until
http://hg.python.org/cpython/rev/398f8770bf0d
|
msg175396 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2012-11-11 21:51 |
> TESTFN_UNDECODABLE is not detected for cp1250, cp1251, cp1252, cp1254, cp1257 and cp1258.
The Python encoding and the real codec used by Windows are different: Python fails to decode bytes 0x80-0x9f, but Windows does decode them. I prefer to avoid these bytes to not rely too much on the Python codec.
|
msg175399 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2012-11-11 22:08 |
These encodings used not only on Windows.
|
msg175402 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2012-11-11 22:15 |
> I suppose you noticed you broke a bunch of buildbots :)
Failures occur on FreeBSD, OpenIndiana and some other buildbots which don't set a locale and so use the "C" locale. main() decodes command line arguments from the locale encoding using _Py_char2wchar(). On these OSes, the "C" locale uses the ISO-8859-1 encoding, but the problem is that nl_langinfo(CODESET) announces ASCII :-/ test_cmd_line.test_undecodable_code() handles this case. Extract of a comment:
# _Py_char2wchar() decoded b'\xff' as '\xff' even if the locale is
# C and the locale encoding is ASCII. It occurs on FreeBSD, Solaris
# and Mac OS X.
Mac OS X is now using UTF-8 to decode the command line arguments.
I just created the issue #16455 to fix FreeBSD and OpenIndiana.
I propose to close this issue because I consider it as fixed (#16455 will reenable TESTFN_UNDECODABLE in test_cmd_line_script).
|
msg175406 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2012-11-11 23:12 |
> These encodings used not only on Windows.
You can uses cpXXX encodings explictly to read or write a file, but these encodings are not used for sys.getfilesystemencoding() (or sys.stdout.encoding).
|
msg175413 - (view) |
Author: Roundup Robot (python-dev) |
Date: 2012-11-12 00:24 |
New changeset 6017f09ead53 by Victor Stinner in branch '3.3':
Issue #16218, #16444: Backport improvment on tests for non-ASCII characters
http://hg.python.org/cpython/rev/6017f09ead53
|
msg175423 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2012-11-12 08:05 |
> You can uses cpXXX encodings explictly to read or write a file, but these
> encodings are not used for sys.getfilesystemencoding() (or
> sys.stdout.encoding).
At least CP1251 has been used for many cyrillic locales in before-UTF8 age (I use it sometimes still). For now CP1251 is the default encoding for Byelorussian and Bulgarian:
$ grep CP /usr/share/i18n/SUPPORTED
be_BY CP1251
bg_BG CP1251
ru_RU.CP1251 CP1251
yi_US CP1255
|
msg176893 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2012-12-04 10:40 |
Ping.
|
msg176955 - (view) |
Author: Roundup Robot (python-dev) |
Date: 2012-12-04 20:42 |
New changeset ed0ff4b3d1c4 by Victor Stinner in branch 'default':
Issue #16444: test more bytes in support.TESTFN_UNDECODABLE to support more Windows code pages
http://hg.python.org/cpython/rev/ed0ff4b3d1c4
|
msg176958 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2012-12-04 20:53 |
Ooook, all remaining issues about undecodable bytes should now be fixed (until someone opens a new one? :-))
|
msg178868 - (view) |
Author: Roundup Robot (python-dev) |
Date: 2013-01-03 00:59 |
New changeset 41658a4fb3cc by Victor Stinner in branch '3.2':
Issue #16218, #16414, #16444: Backport FS_NONASCII, TESTFN_UNDECODABLE,
http://hg.python.org/cpython/rev/41658a4fb3cc
New changeset 4d40c1ce8566 by Victor Stinner in branch '3.3':
(Merge 3.2) Issue #16218, #16414, #16444: Backport FS_NONASCII,
http://hg.python.org/cpython/rev/4d40c1ce8566
|
|
Date |
User |
Action |
Args |
2022-04-11 14:57:38 | admin | set | github: 60648 |
2013-01-03 01:07:37 | vstinner | set | versions:
+ Python 3.2, Python 3.3 |
2013-01-03 00:59:43 | python-dev | set | messages:
+ msg178868 |
2012-12-04 20:53:30 | vstinner | set | status: open -> closed resolution: fixed messages:
+ msg176958
|
2012-12-04 20:42:00 | python-dev | set | messages:
+ msg176955 |
2012-12-04 10:40:07 | serhiy.storchaka | set | type: enhancement messages:
+ msg176893 stage: patch review |
2012-11-15 15:53:18 | asvetlov | set | nosy:
+ asvetlov
|
2012-11-12 08:05:43 | serhiy.storchaka | set | messages:
+ msg175423 |
2012-11-12 00:24:14 | python-dev | set | messages:
+ msg175413 |
2012-11-11 23:12:16 | vstinner | set | messages:
+ msg175406 |
2012-11-11 22:15:48 | vstinner | set | messages:
+ msg175402 |
2012-11-11 22:08:49 | serhiy.storchaka | set | messages:
+ msg175399 |
2012-11-11 21:51:52 | vstinner | set | messages:
+ msg175396 |
2012-11-10 21:31:49 | python-dev | set | messages:
+ msg175296 |
2012-11-10 18:24:50 | pitrou | set | nosy:
+ pitrou messages:
+ msg175291
|
2012-11-10 12:21:27 | serhiy.storchaka | set | messages:
+ msg175275 |
2012-11-10 11:07:35 | python-dev | set | nosy:
+ python-dev messages:
+ msg175272
|
2012-11-10 10:50:07 | vstinner | set | messages:
+ msg175271 |
2012-11-09 11:14:20 | serhiy.storchaka | set | messages:
+ msg175223 |
2012-11-09 11:09:49 | serhiy.storchaka | set | messages:
+ msg175222 |
2012-11-09 11:00:23 | serhiy.storchaka | set | messages:
+ msg175221 |
2012-11-08 23:50:56 | vstinner | set | messages:
+ msg175209 |
2012-11-08 23:04:16 | vstinner | set | messages:
+ msg175202 |
2012-11-08 22:53:12 | vstinner | set | messages:
+ msg175201 |
2012-11-08 22:52:14 | vstinner | create | |