New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use support.TESTFN_UNDECODABLE on UNIX #60648
Comments
Attached patch changes how support.TESTFN_UNDECODABLE is computed on UNIX: use the filesystem encoding in *strict* mode, not using the surrogateescape error handler. So we can use support.TESTFN_UNDECODABLE to check if a function uses correctly the surrogateescape error handler and/or check if it behaves correctly with non-ASCII characters. The patch uses also support.TESTFN_UNDECODABLE (only on UNIX) in test_cmd_line_script.test_non_ascii() to also check that the fix for bpo-16218 works with UTF-8 locale encoding. Please test the patch on UNIX, Windows and Mac OS X. We may also use support.TESTFN_UNDECODABLE in test_cmd_line_script.test_non_ascii() on Windows, I will check. Windows has some strange behaviour with undecodable characters: some of them are replaced a character with a similar glyph. |
The patch contains two print to help debugging the patch itself, these print statements must be removed later. +print("TESTFN_UNDECODABLE = %a" % TESTFN_UNDECODABLE) |
Oh, subprocess doesn't support passing bytes arguments to a program anymore (since Python 3.0). So it's better to use TESTFN_NONASCII instead for this test ;-) It confirms that we need two constants depending on the context. It dependson the platform and how the data is read/written: sometimes undecodable characters are supported on any platform (ex: base64 encoder), sometimes undecodable characters are not supported (ex: distutils expects valid metadata), sometimes it depends on the platform (ex: this test). |
The full test suite pass on:
|
Try b'\x81', b'\x98', b'\xae', b'\xd5', b'\xff'. They are undecodable in all 1-byte encodings. b'\x81' : shift_jis_2004 shift_jis shift_jisx0213 cp869 cp874 cp932 cp1250 cp1252 cp1253 cp1254 cp1255 cp1257 cp1258 |
Try b'\xed\xb2\x80' and b'\xed\xb4\x80' for UTF-8 (on Unix and Mac OS X). b'\xed\xb2\x80' is b'\x80'.decode('utf-8', 'surrogateescape').encode('utf-8', 'surrogatepass'). |
The matter is not only in the fact that tests passed. They should fail if the original bug occurs again. Have you tried to restore the bugs? |
Right, but I don't want to introduce a regression :-)
test_cmd_line_script.test_non_ascii() comes from the issue bpo-16218, changeset 23ebe277e982. I checked this issue: support_undecodable.patch checks for non-regression with UTF-8 (and ASCI and ISO-8859-1) locale encoding on UNIX. test_genericpath.test_non_ascii() comes from the issue bpo-3426, this fix comes from the issue bpo-3187, changeset 8a7c930abab6. I don't want to spend time on trying the new test on this issue because this 8a7c930abab6 is a major change, I don't see how to revert it just to test the issue. I consider the issue has fixed, and the new test should not reduce the test coverage, but just increase it ;-) |
New changeset 6b8a8bc6ba9c by Victor Stinner in branch 'default': |
TESTFN_UNDECODABLE is not detected for cp1250, cp1251, cp1252, cp1254, cp1257 and cp1258. Just add b'\x81\x98\xae\xd5\xff', at leas one of this bytes undecodable in some encoding which has any undecodable bytes. |
I suppose you noticed you broke a bunch of buildbots :) |
New changeset 398f8770bf0d by Victor Stinner in branch 'default': |
The Python encoding and the real codec used by Windows are different: Python fails to decode bytes 0x80-0x9f, but Windows does decode them. I prefer to avoid these bytes to not rely too much on the Python codec. |
These encodings used not only on Windows. |
Failures occur on FreeBSD, OpenIndiana and some other buildbots which don't set a locale and so use the "C" locale. main() decodes command line arguments from the locale encoding using _Py_char2wchar(). On these OSes, the "C" locale uses the ISO-8859-1 encoding, but the problem is that nl_langinfo(CODESET) announces ASCII :-/ test_cmd_line.test_undecodable_code() handles this case. Extract of a comment: # _Py_char2wchar() decoded b'\xff' as '\xff' even if the locale is Mac OS X is now using UTF-8 to decode the command line arguments. I just created the issue bpo-16455 to fix FreeBSD and OpenIndiana. I propose to close this issue because I consider it as fixed (bpo-16455 will reenable TESTFN_UNDECODABLE in test_cmd_line_script). |
You can uses cpXXX encodings explictly to read or write a file, but these encodings are not used for sys.getfilesystemencoding() (or sys.stdout.encoding). |
New changeset 6017f09ead53 by Victor Stinner in branch '3.3': |
At least CP1251 has been used for many cyrillic locales in before-UTF8 age (I use it sometimes still). For now CP1251 is the default encoding for Byelorussian and Bulgarian: $ grep CP /usr/share/i18n/SUPPORTED
be_BY CP1251
bg_BG CP1251
ru_RU.CP1251 CP1251
yi_US CP1255 |
Ping. |
New changeset ed0ff4b3d1c4 by Victor Stinner in branch 'default': |
Ooook, all remaining issues about undecodable bytes should now be fixed (until someone opens a new one? :-)) |
New changeset 41658a4fb3cc by Victor Stinner in branch '3.2': New changeset 4d40c1ce8566 by Victor Stinner in branch '3.3': |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: