classification
Title: incorrect os.path.supports_unicode_filenames
Type: behavior Stage: patch review
Components: Library (Lib), Tests Versions: Python 3.1, Python 3.2, Python 2.7, Python 2.6
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: BreamoreBoy, ezio.melotti, flox, haypo, joe.amenta, jvr, loewis, ned.deily, r.david.murray, ronaldoussoren
Priority: low Keywords: patch

Created on 2003-07-08 09:42 by jvr, last changed 2010-09-17 23:37 by haypo. This issue is now closed.

Files
File name Uploaded Description Edit
test_supports_unicode_filenames.patch joe.amenta, 2010-01-12 21:35 Tests supports_unicode_filenames against its documented value - fails on Linux
posixpath_darwin.patch haypo, 2010-09-14 11:47
Messages (30)
msg16955 - (view) Author: Just van Rossum (jvr) * Date: 2003-07-08 09:42
At least on OSX, unicode file names are pretty much fully 
supported, yet os.path.supports_unicode_filenames is False 
(it comes from posixpath.py, which hard codes it). What 
would be a proper way to detect unicode filename support 
for posix platforms?
msg16956 - (view) Author: Brett Cannon (brett.cannon) * (Python committer) Date: 2003-07-09 18:07
Logged In: YES 
user_id=357491

What happens if you try to create a file using Unicode names?  
Could a test get the temp directory for the platform, write a file 
with Unicode in it, and then check for an error?  Or if it always 
succeeds, write it, and then see if the results match?

In other words, does writing Unicode to an ASCII file system ever 
lead to a mangling of the name?
msg16957 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2003-07-10 21:01
Logged In: YES 
user_id=21627

On POSIX platforms in general, detecting Unicode file name
support is not possible. Posix uses open(2), and only
open(2) (alon with creat(2), stat(2) etc) to access files.
There is no open_w, or open_utf8, or the like. So file names
are byte strings on Posix, and it will stay that way forever.
(There is actually also fopen, but that doesn't change the
situation at all).

On OSX, the situation is somewhat different from POSIX, as
you have additional functions to open files (which Python
apparently does not use, though), and because OSX specifies
that the byte strings have to be NFD UTF-8 (which Python
violates AFAICT).

The documentation for supports_unicode_filenames says

True if arbitrary Unicode strings can be used as file names
(within limitations imposed by the file system), and if
\function{os.listdir()} returns Unicode strings for a Unicode
argument.

While the first part is true for OSX, I don't think the
second part is. If that ever gets corrected (or verified),
no further detection is necessary - just set
macpath.supports_unicode_filenames for darwin (assuming you
use macpath.py on that system).
msg16958 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2003-07-10 21:05
Logged In: YES 
user_id=21627

Brett: As for "writing Unicode to an ASCII file system":
there is no such thing. POSIX file systems accept arbitrary
bytes, and don't interpret them except by looking at the
path separator (in ASCII).

So you can put Latin-1, KOI8-r, EUC-JP, UTF-8, gb2312, etc
all on a single file system, and people actually do that.
The convention is that bytes in file names are interpreted
according to the locale's encoding. This is just a
convention, and it has some significant flaws. Python
follows that convention, meaning that you can use arbitrary
Unicode strings in open(), as long as they are supported in
the locale's encoding.
msg16959 - (view) Author: Just van Rossum (jvr) * Date: 2003-07-10 21:13
Logged In: YES 
user_id=92689

> On OSX, the situation is somewhat different from POSIX, as
> you have additional functions to open files (which Python
> apparently does not use, though), and because OSX specifies
> that the byte strings have to be NFD UTF-8 (which Python
> violates AFAICT).

(I'm not 100% sure, but I think the OS corrects that)

> True if arbitrary Unicode strings can be used as file names
> (within limitations imposed by the file system), and if
> \function{os.listdir()} returns Unicode strings for a Unicode
> argument.
> 
> While the first part is true for OSX, I don't think the
> second part is.

It is, we had a long discussion about that back when I 
implemented that ;-)

> If that ever gets corrected (or verified),
> no further detection is necessary - just set
> macpath.supports_unicode_filenames for darwin (assuming you
> use macpath.py on that system). 

Darwin is a posix platform, so I'll have to add a switch to 
posixpath.py. Unless you object to that, I will do that.
msg16960 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2003-07-10 22:34
Logged In: YES 
user_id=21627

>I'm not 100% sure, but I think the OS corrects that

I'm relatively sure that the OS doesn't. The OS won't
complain if you pass a file name that isn't UTF-8 at all -
Finder will then fail to display the file correctly. There
are CoreFoundationsBasicServicesSomething functions that you
are supposed to call to correct that; Python does not use them.

If you think setting the flag for darwin is fine in
posixpath, just go ahead.
msg16961 - (view) Author: Just van Rossum (jvr) * Date: 2003-07-11 07:48
Logged In: YES 
user_id=92689

Done in rev. 1.61 of posixpath.py.

(Actually, OSX does complain when you feed open() a non-valid 
utf-8 string (albeit with a misleading error message). The OS also 
makes sure the name is converted to its preferred form, eg. if I 
create a file named u'\xc7', I can also open it as u'C\u0327', and 
os.listdir() will always show the latter, no matter how you created 
the file.)
msg16962 - (view) Author: Just van Rossum (jvr) * Date: 2003-07-17 16:20
Logged In: YES 
user_id=92689

Reopeing as the fix I checked in caused problems in 
test_pep277.py. Postpone work on this until after 2.3 is released.
msg16963 - (view) Author: Just van Rossum (jvr) * Date: 2003-07-17 16:21
Logged In: YES 
user_id=92689

(forgot to mention: my checkin was backed out)
msg16964 - (view) Author: Just van Rossum (jvr) * Date: 2005-06-28 09:46
Logged In: YES 
user_id=92689

Hmm, two years later and this still hasn't been resolved. Is anyone 
interested to take a stab at it? It would be nice if it could be fixed for 2.5.

(Btw. the only code using os.path.supports_unicode_filenames that I'm 
aware of is Jason Orendorff's path module.)
msg16965 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2005-06-28 21:04
Logged In: YES 
user_id=21627

I don't care about this issue, as I think
supports_unicode_filenames is a pretty useless property
these days. If somebody changes the current value from False
to True, just make sure that the testsuite still passes.
msg97652 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2010-01-12 19:44
Maybe os.path.supports_unicode_filenames should be deprecated.
The doc currently says:
"True if arbitrary Unicode strings can be used as file names (within limitations imposed by the file system), and if os.listdir() returns Unicode strings for a Unicode argument."

On Linux both the things work, even if the value of os.path.supports_unicode_filenames is still False:
>>> os.path.supports_unicode_filenames
False
>>> open(u'fòòbàr', 'w')
<open file u'f\xf2\xf2b\xe0r', mode 'w' at 0x9470778>
>>> os.listdir(u'.')
[u'f\xf2\xf2b\xe0r', ...]
>>> open(u'fòòbàr')
<open file u'f\xf2\xf2b\xe0r', mode 'r' at 0x9470778>
msg97655 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-01-12 20:16
In addition, whether or not true unicode filenames are supported really depends, at least on Linux, on the *filesystem*, not on the OS (for some definition of support).  In other words, I think os.path.supports_unicode_filenames is an API design that is broken and should probably be dropped.
msg97658 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2010-01-12 21:14
Additionally it filters out test_pep277 on some platforms.

But seemingly, it is not needed anymore with this patch.
msg97660 - (view) Author: Joe Amenta (joe.amenta) Date: 2010-01-12 21:35
If it is decided to keep supports_unicode_filenames, here is a patch for test_os.py that verifies the value of supports_unicode_filenames against the following line from the documentation:
"True if arbitrary Unicode strings can be used as file names (within limitations imposed by the file system), and if os.listdir() returns Unicode strings for a Unicode argument."
msg101132 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2010-03-15 18:35
With r78594, test_pep277 is active on all platforms having Unicode-friendly filesystem encoding.
msg114252 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2010-08-18 17:22
There are at least three messages stating that os.path.supports_unicode_filenames should go so can someone please provide a definitive statement regarding its future.
msg116064 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2010-09-11 00:10
test_pep277.patch removes the usage of os.path.supports_unicode_filenames from test_pep277: the test still pass on Debian Sid (Linux). Can someone test the patch on Mac OS X, FreeBSD and Solaris (and maybe other POSIX/UNIX OSes)?

About Windows: supports_unicode_filenames is False if sys.getwindowsversion().platform < 2: win32s (0) or Windows 9x/ME (1). I don't know win32s, but I know that Windows 9x/ME is not more supported.
msg116065 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2010-09-11 00:15
Oops, forget test_pep277.patch: I misunderstood r81149 (new way to detect if the filesystem supports unicode or not). test_pep277 fails with my patch on Linux with LC_CTYPE=C.
msg116068 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2010-09-11 00:24
r84701 fixes supports_unicode_filenames's definition in Python 3.2 (and r84702 in Python 3.1): os.listdir(str) now always return unicode filenames (including non-ascii characters).
msg116069 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2010-09-11 00:37
> Maybe os.path.supports_unicode_filenames should be deprecated.
> The doc currently says:
> "True if arbitrary Unicode strings can be used as file names 
> (within limitations imposed by the file system), and if os.listdir()
> returns Unicode strings for a Unicode argument."
>
> On Linux both the things work, even if the value of 
> os.path.supports_unicode_filenames is still False:
> (...)

It depends on the locale encoding:

$ LC_CTYPE=C ./python
Python 3.2a2+ (py3k, Sep 11 2010, 01:48:43) 
>>> import sys; sys.getfilesystemencoding()
'ascii'
>>> open('\xe9', 'w').close()
...
UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 0: ordinal not in range(128)

With utf-8, surrogates are forbidden. Eg.

$ ./python
Python 3.2a2+ (py3k, Sep 11 2010, 01:48:43) 
>>> import sys; sys.getfilesystemencoding()
'utf-8'
>>> open('\uDC00', 'w').close()
...
UnicodeEncodeError: 'utf-8' codec can't encode character '\udc00' in position 0: surrogates not allowed

On Windows, Python uses the unicode API and so the unicode support doesn't depend on the locale encoding (on the ansi code page). Surrogates are accepted on Windows: '\uDC00' is a valid filename.

I think that supports_unicode_filenames is still useful to check if the filesystem API uses bytes (Linux, FreeBSD, Solaris, ...) or characters (Mac OS X, Windows). Mac OS X is a special case because the C API uses char* (byte string), but the filesystem encoding is fixed to utf-8 and it doesn't accept invalid utf-8 filenames. So I would like to say that supports_unicode_filenames should be True on Mac OS X (which was the initial request).
msg116214 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2010-09-12 16:31
> About Windows: supports_unicode_filenames is False if
> sys.getwindowsversion().platform < 2: win32s (0) or Windows 9x/ME
> (1). I don't know win32s, but I know that Windows 9x/ME is not more
> supported.

Win32s is long gone. It was an emulation layer to support Win32 on
Windows 3.1.
msg116215 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2010-09-12 16:36
> I think that supports_unicode_filenames is still useful to check if
> the filesystem API uses bytes (Linux, FreeBSD, Solaris, ...) or
> characters (Mac OS X, Windows). Mac OS X is a special case because
> the C API uses char* (byte string), but the filesystem encoding is
> fixed to utf-8 and it doesn't accept invalid utf-8 filenames. So I
> would like to say that supports_unicode_filenames should be True on
> Mac OS X (which was the initial request).

Sounds reasonable.
msg116347 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2010-09-13 20:26
r84784 sets os.path.supports_unicode_filenames to True on Mac OS X (macpath module).

About test_supports_unicode_filenames.patch. test_unicode_listdir() is wrong: os.listdir(str) always return str (see r84701). "verify that the new file's name is equal to the name we tried" check of test_unicode_filename() is also wrong: newfile.name is always equal to fname, it doesn't depend on support_unicode_filenames. Since the test is wrong, I don't want to commit it. test_pep277 is enough to test the creation of files with unicode names.

I don't see anything else to do now, so I close this issue. Reopen it if I forgot something, or open a new issue.
msg116348 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2010-09-13 20:32
I backported r84701 and r84784 to Python 2.7 (r84787).
msg116354 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2010-09-13 22:07
There seems to be some confusion about the macpath.py module.  I'm not sure why it even exists in Python 3.  Note it has to do with obsolete Classic MacOS-style paths (colon-separated paths) which are available on Mac OS X only through deprecated Carbon interfaces.  I'm not even sure that those style paths do support unicode.  More importantly, the underlying Carbon interfaces that macpath.py uses were removed for Python 3.  AFAIK, virtually nothing on OS X uses these style paths anymore and, with the removal of all the old Mac Carbon support in Python 3, I don't think there is any Python module that can use these paths other than macpath.  I think this module should be marked for deprecation and removed.  There is no reason to modify it nor add a NEWS note, even for 2.7.
msg116366 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2010-09-14 05:18
(I've opened Issue9850 to document the brokenness of macpath and suggest its deprecation and removal.)
msg116386 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2010-09-14 11:47
> There seems to be some confusion about the macpath.py module. (...)

Oops. I thought that Mac OS X uses macpath, but in fact it is posixpath. Can you try my new patch posixpath_darwin.patch? I reopen the issue because I patched the wrong module. I suppose that Python 2.7 has the same issue: posixpath should be patched, not macpath.

My patch leaves macpath with supports_unicode_filenames=True. If I understood correctly: macpath should be removed (#9850).
msg116429 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2010-09-15 00:54
No problems noted with a quick test of posixpath_darwin.patch on 10.6 so looks good.  It will get regression tested on more configurations sometime later.
msg116740 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2010-09-17 23:37
> No problems noted with a quick test of posixpath_darwin.patch 
> on 10.6 so looks good.

Ok thanks. Fix commited to 3.2 (r84866) and 2.7 (r84868). I kept my patch on macpath (supports_unicode_filenames=True) because it is still valid (even if it is not used). Or is it wrong that Mac OS 9 speaks unicode?
History
Date User Action Args
2010-09-17 23:37:39hayposetstatus: open -> closed
resolution: fixed
messages: + msg116740
2010-09-15 00:54:49ned.deilysetmessages: + msg116429
2010-09-14 11:47:37hayposetstatus: closed -> open
files: + posixpath_darwin.patch
resolution: fixed -> (no value)
messages: + msg116386
2010-09-14 05:18:20ned.deilysetmessages: + msg116366
2010-09-13 22:07:22ned.deilysetnosy: + ronaldoussoren, ned.deily
messages: + msg116354
2010-09-13 20:32:37hayposetmessages: + msg116348
2010-09-13 20:26:59hayposetstatus: open -> closed
resolution: fixed
messages: + msg116347
2010-09-13 19:42:39hayposetfiles: - test_pep277.patch
2010-09-12 16:36:11loewissetmessages: + msg116215
2010-09-12 16:31:34loewissetmessages: + msg116214
2010-09-11 00:37:12hayposetmessages: + msg116069
2010-09-11 00:24:47hayposetmessages: + msg116068
2010-09-11 00:15:46hayposetmessages: + msg116065
2010-09-11 00:10:36hayposetfiles: + test_pep277.patch

messages: + msg116064
2010-08-18 17:22:59BreamoreBoysetnosy: + BreamoreBoy
messages: + msg114252
2010-07-31 23:21:27eric.araujosetnosy: + haypo
2010-03-15 18:35:38floxsettype: behavior
messages: + msg101132
2010-03-15 18:34:53floxsetfiles: - issue767645_test_pep277.py
2010-01-28 17:58:27floxsetnosy: loewis, jvr, ezio.melotti, r.david.murray, joe.amenta, flox
versions: + Python 3.1, Python 2.7, Python 3.2
components: + Tests
stage: patch review
2010-01-12 21:35:24joe.amentasetfiles: + test_supports_unicode_filenames.patch

nosy: + joe.amenta
messages: + msg97660

keywords: + patch
2010-01-12 21:14:55floxsetfiles: + issue767645_test_pep277.py

nosy: + flox
messages: + msg97658

resolution: later -> (no value)
2010-01-12 20:16:44r.david.murraysetnosy: + r.david.murray
messages: + msg97655
2010-01-12 19:44:01ezio.melottisetmessages: + msg97652
2010-01-12 19:03:13brett.cannonsetnosy: - brett.cannon
2010-01-12 18:49:31ezio.melottisetnosy: + ezio.melotti
2008-01-20 19:24:27christian.heimessetpriority: normal -> low
versions: + Python 2.6, - Python 2.3
2003-07-08 09:42:15jvrcreate