classification
Title: TESTFN_UNICODE and TESTFN_UNDECODABLE
Type: Stage:
Components: Tests, Unicode Versions: Python 3.1, Python 3.2
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: vstinner Nosy List: ocean-city, vstinner
Priority: normal Keywords: patch

Created on 2010-09-10 09:39 by ocean-city, last changed 2010-09-13 20:39 by vstinner. This issue is now closed.

Files
File name Uploaded Description Edit
find_unencode_filename-2.py vstinner, 2010-09-10 11:25
unicode_file.patch vstinner, 2010-09-11 05:56
Messages (13)
msg115989 - (view) Author: Hirokazu Yamamoto (ocean-city) * (Python committer) Date: 2010-09-10 09:39
Hello. I noticed test suite reports WARNING every time.

///////////////////////////////////////////////////

E:\python-dev>py3k -m test.regrtest test_os
WARNING: The filename '@test_464_tmp-共有される' CAN be encoded by the filesyste
m encoding (mbcs). Unicode filename tests may not be effective
(snip)

///////////////////////////////////////////////////

This happens because TESTFN_UNICODE_UNDECODABLE in Lib/test/support.py
*is* decodable on Japanese environment (cp932).

It is easy to make this really undecodable in Japanese.
Using the characters like "\u2661" or "\u2668" (Former is heart mark,
latter is "Onsen" - Hot spring mark) I could remove the warning by this.
    TESTFN_UNENCODABLE = TESTFN + "-\u5171\u6709\u3055\u308c\u308b\u2661\u2668"

///////////////////////////////////////////////////

And another issue. This happens only on test_unicode_file,

///////////////////////////////////////////////////

E:\python-dev>py3k -m test.test_unicode_file
Traceback (most recent call last):
  File "e:\python-dev\py3k\lib\test\test_unicode_file.py", line 12, in <module>
    TESTFN_UNICODE.encode(TESTFN_ENCODING)
UnicodeEncodeError: 'mbcs' codec can't encode characters in position 0--1: inval
id character

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "e:\python-dev\py3k\lib\runpy.py", line 160, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "e:\python-dev\py3k\lib\runpy.py", line 73, in _run_code
    exec(code, run_globals)
  File "e:\python-dev\py3k\lib\test\test_unicode_file.py", line 16, in <module>
    raise unittest.SkipTest("No Unicode filesystem semantics on this platform.")

unittest.case.SkipTest: No Unicode filesystem semantics on this platform.

///////////////////////////////////////////////////

This happens because TESTFN_UNICODE cannot be encoded in Japanese.

E:\python-dev>py3k
Python 3.2a2+ (py3k:84663M, Sep 10 2010, 13:24:41) [MSC v.1400 32 bit (Intel)] o
n win32
Type "help", "copyright", "credits" or "license" for more information.
>>> print("-\xe0\xf2")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'cp932' codec can't encode character '\xe0' in position 1: i
llegal multibyte sequence

But interesting, this bytes sequence "\xe0\xf2" can be read as
cp932 multibyte characters.

E:\python-dev>python
Python 2.6.6 (r266:84297, Aug 24 2010, 18:46:32) [MSC v.1500 32 bit (Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.
>>> print "\xe0\xf2"
瑣
>>> "\xe0\xf2".decode("cp932")
u'\u7463'

E:\python-dev>py3k
Python 3.2a2+ (py3k:84663M, Sep 10 2010, 13:24:41) [MSC v.1400 32 bit (Intel)] o
n win32
Type "help", "copyright", "credits" or "license" for more information.
>>> print('\u7463')
瑣

I believe this value "\xe0\xf2" came from python2.x, maybe "\u7463"
should be used here? I'm not sure it can be decoded everywhere using
other encodings, though.
msg115991 - (view) Author: Hirokazu Yamamoto (ocean-city) * (Python committer) Date: 2010-09-10 09:42
And one little thing. I noticed variable name varies in
python2.x and python3.x.
TESTFN_UNICODE_UNDECODEABLE (2.x)
TESTFN_UNICODE_UNDECODABLE  (3.x)

I think 2.x should be unified into 3.x name. Thanks.
msg115994 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-09-10 10:27
> WARNING: The filename '@test_464_tmp-共有される' CAN be encoded 
> by (...) cp932

We should find character not encodable in any Windows code page, but accepted as filenames.

> characters like "\u2661" or "\u2668" (...)

mbcs uses "ANSI" code pages: cp1250..cp1258 and cp874 (and maybe others because you wrote that your setup uses cp932):
http://en.wikipedia.org/wiki/Code_page#Windows_.28ANSI.29_code_pages

I wrote a short script to find a unencodable filename (attached to this issue). Output:

u'\u0301' is encodable to cp1258
u'\u0363' is not encodable to any code page
u'\u2661' is encodable to cp949
u'\u5171' is encodable to cp932, cp936, cp949, cp950

(CODE_PAGES constant of the script might be incomplete)

u'\u2661' is not a good candidate. u'\u0363' looks better. Be we can mix different characters to limit the probability that the whole string is encodable. Example:

u'\u2661\u5171' is encodable to cp949
u'\u0301\u0363\u2661\u5171' is not encodable to any code page

> TESTFN_UNICODE_UNDECODEABLE (2.x)

This is a typo fixed by r83987 in py3k.
msg115997 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-09-10 10:45
See also #9820.
msg115999 - (view) Author: Hirokazu Yamamoto (ocean-city) * (Python committer) Date: 2010-09-10 10:57
Thank you for a reply.

> u'\u2661' is encodable to cp949
Doh!

I can imagine it's difficult to find out such character. ;-)
msg116000 - (view) Author: Hirokazu Yamamoto (ocean-city) * (Python committer) Date: 2010-09-10 11:01
I also confirmed '\u0363' can be used as filename.
"dir" command cannot print filename correctly, though.

 E:\python-dev\foo のディレクトリ

2010/09/10  19:44       <DIR>          .
2010/09/10  19:44       <DIR>          ..
2010/09/10  19:44                    3 ͣ
               1 個のファイル                   3 バイト
               2 個のディレクトリ   2,788,741,120 バイトの空き領域
msg116007 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-09-10 11:25
> "dir" command cannot print filename correctly, though.

Who cares? We just have to be able to create a file with a name containing non encodable characters, list the directory, and then remove this evil file.

--

With r84666, Python uses "-\u5171\u6709\u3055\u308c\u308b" suffix for TESTFN_UNENCODABLE. Does it fix the issue on your host?

I attached an improved version of find_unencode_filename.py (with more code pages).

--

> > TESTFN_UNICODE_UNDECODEABLE (2.x)
> This is a typo fixed by r83987 in py3k.

I backported the fix to 2.7 (r84667).
msg116010 - (view) Author: Hirokazu Yamamoto (ocean-city) * (Python committer) Date: 2010-09-10 11:43
Thank you for the fix.

> Who cares? We just have to be able to create a file with a name
> containing non encodable characters, list the directory, and then
> remove this evil file.

I won't. ;-)
Sorry, that was not compliant. I just thought it's
interesting to see such a white box in console, and see it can be
opened and deleted correctly.
msg116012 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-09-10 12:21
> With r84666, Python uses "-\u5171\u6709\u3055\u308c\u308b" 
> suffix for TESTFN_UNENCODABLE.

Backported to 3.1 as r84668. I don't want to patch Python 2.x (its unicode support is lower and the code is too different than Python3) and so I close the issue.
msg116074 - (view) Author: Hirokazu Yamamoto (ocean-city) * (Python committer) Date: 2010-09-11 03:55
How about TESTFN_UNICODE (test_unicode_file) issue?
Should I reopen this entry or invalid?
msg116081 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-09-11 05:56
> How about TESTFN_UNICODE (test_unicode_file) issue?

  File "e:\python-dev\py3k\lib\test\test_unicode_file.py", line 12, in <module>
    TESTFN_UNICODE.encode(TESTFN_ENCODING)
UnicodeEncodeError: 'mbcs' codec can't encode characters in position 0--1: invalid character

That's because mbcs encoding is more strict on Windows. But this test is not needed on Windows. Please try attached patch (unicode_file.patch).
msg116082 - (view) Author: Hirokazu Yamamoto (ocean-city) * (Python committer) Date: 2010-09-11 07:23
Thank you, your patch works.

E:\python-dev\py3k>py3k -m test.test_unicode_file
test_directories (__main__.TestUnicodeFiles) ... ok
test_single_files (__main__.TestUnicodeFiles) ... ok

----------------------------------------------------------------------
Ran 2 tests in 0.611s

OK
[69875 refs]
msg116094 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-09-11 12:53
> Thank you, your patch works.

Ok, patch commited to 3.2 as r84710. Thanks for your feedback.
History
Date User Action Args
2010-09-13 20:39:35vstinnersetstatus: open -> closed
resolution: fixed
2010-09-11 12:53:20vstinnersetmessages: + msg116094
2010-09-11 07:23:40ocean-citysetmessages: + msg116082
2010-09-11 05:56:51vstinnersetfiles: + unicode_file.patch
2010-09-11 05:56:39vstinnersetstatus: closed -> open
keywords: + patch
resolution: fixed -> (no value)
messages: + msg116081
2010-09-11 03:55:04ocean-citysetmessages: + msg116074
2010-09-10 12:21:55vstinnersetstatus: open -> closed
resolution: fixed
messages: + msg116012
2010-09-10 11:43:55ocean-citysetmessages: + msg116010
2010-09-10 11:25:17vstinnersetfiles: - find_unencode_filename.py
2010-09-10 11:25:02vstinnersetfiles: + find_unencode_filename-2.py

messages: + msg116007
2010-09-10 11:01:49ocean-citysetmessages: + msg116000
2010-09-10 10:57:38ocean-citysetmessages: + msg115999
2010-09-10 10:45:34vstinnersetmessages: + msg115997
2010-09-10 10:27:52vstinnersetfiles: + find_unencode_filename.py

messages: + msg115994
2010-09-10 09:48:09amaury.forgeotdarcsetassignee: vstinner

nosy: + vstinner
2010-09-10 09:42:11ocean-citysetmessages: + msg115991
2010-09-10 09:39:51ocean-citycreate