Issue 683592: unicode support for os.listdir()

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/37949

classification

Title:	unicode support for os.listdir()
Type:		Stage:
Components:	Library (Lib)	Versions:

process

Status:	closed	Resolution:	accepted
Dependencies:		Superseder:
Assigned To:	loewis	Nosy List:	gvanrossum, jackjansen, jvr, lemburg, loewis, nnorwitz
Priority:	normal	Keywords:	patch

Created on 2003-02-09 21:43 by jvr, last changed 2022-04-10 16:06 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
listdir_unicode.patch	jvr, 2003-02-10 10:49	unicode support for os.listdir, take 3
listdir_unicode_arg.patch	jvr, 2003-03-03 14:32	only return unicode if the argument was unicode, + doc

Messages (43)
msg42747 - (view)	Author: Just van Rossum (jvr) *	Date: 2003-02-09 21:43
The attached patch makes os.listdir() return unicode strings, on plaforms that have Py_FileSystemDefaultEncoding defined as non-NULL. I'm by no means sure this is the right thing to do; it does seem right on OSX where Py_FileSystemDefaultEncoding is (or rather: will be real soon, I'm waiting for Jack's approval) utf-8. I'd be happy to add the code in an OSX-specific switch. A more subtle variant could perhaps only return unicode strings if the file name is not ASCII.
msg42748 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2003-02-10 01:16
Logged In: YES user_id=6380 At the very least, I'd like it to return Unicode only when the original string isn't just ASCII.
msg42749 - (view)	Author: Neal Norwitz (nnorwitz) *	Date: 2003-02-10 03:07
Logged In: YES user_id=33168 The code which uses unicode APIs should probably be wrapped with: #ifdef Py_USING_UNICODE /* code */ #endif
msg42750 - (view)	Author: Just van Rossum (jvr) *	Date: 2003-02-10 09:12
Logged In: YES user_id=92689 Applied both suggestions. However, I'm not sure if my ASCII test does the right thing, or at least I don't think it does if Py_FileSystemDefaultEncoding is not a superset of ASCII.
msg42751 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2003-02-10 09:24
Logged In: YES user_id=38388 Your test will probably catch most cases, but it could fail for e.g. UTF-16. The only true test would be to first convert to Unicode and then try to convert back to ASCII. If you get an error you can be sure that the text is not ASCII compatible. Given that .listdir() involves lots of IO I think the added performance hit wouldn't be noticable.
msg42752 - (view)	Author: Just van Rossum (jvr) *	Date: 2003-02-10 09:51
Logged In: YES user_id=92689 I don't see hot UTF-16 could be a valid value for Py_FileSystemDefaultEncoding, as for most platforms the file name can't contain null bytes. My looking at the NAMELEN() spaghetti, it seems platforms without HAVE_DIRENT_H might still support embedded null bytes. Any wisdom on this?
msg42753 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2003-02-10 10:17
Logged In: YES user_id=38388 The file system does not need to support embedded \0 chars even if it supports UTF-16. It only happens that your test assumes that you have one byte per characters encodings which may not always be true. With UTF-16 your test will see lots of \0 bytes but not necessarily ones which are ord(x)>=128. I'm not sure whether other variable length encodings can result in \0 bytes, e.g. the Asian ones. There's also the possibility of the encoding mapping the ASCII range to other non-ASCII characters, e.g. ShiftJIS does this for the Yen sign. If you absolutely want to use the simple test, I'd at least restrict the test to an ASCII isalnum(x) test and then try the encode/decode method I described if this test fails. Note that isalnum() can be locale dependent on some platforms, so you have to hard-code it.
msg42754 - (view)	Author: Just van Rossum (jvr) *	Date: 2003-02-10 10:49
Logged In: YES user_id=92689 Ok, I went for your original suggestion: always convert to unicode and then try to convert to ascii. See new patch. Or should this use the default encoding? Hm.
msg42755 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2003-02-10 10:55
Logged In: YES user_id=38388 Good question. The default encoding would better fit into the concept, I guess. Instead of PyUnicode_AsASCIIString(v) you'd have to use PyUnicode_AsEncodedString(v, NULL, "strict").
msg42756 - (view)	Author: Just van Rossum (jvr) *	Date: 2003-02-10 11:08
Logged In: YES user_id=92689 On the other hand, if it's not ASCII, wouldn't a unicode string be more appropriate to begin with? If it's encodable with the default encoding, this will happen as soon as the string is used in a piece of unicode-unaware code, right?
msg42757 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2003-02-10 11:24
Logged In: YES user_id=38388 Right, except that injecting Unicode into Unicode-unaware code can be dangerous (e.g. some code might require a string object to work on). E.g. if someone sets the default encoding to Latin-1 he wouldn't expect os.listdir() to suddenly return Unicode for him. This may be a problem in general for the change to os.listdir(). We'll just have to see what happens during the alpha and beta phases.
msg42758 - (view)	Author: Just van Rossum (jvr) *	Date: 2003-02-10 16:24
Logged In: YES user_id=92689 Here's an argument for ASCII and against the default encoding: if the default encoding is different from Py_FileSystemDefaultEncoding, things go wrong: an 8-bit string passed to file() will be interpreted as Py_FileSystemDefaultEncoding (more precisely: will not be interpreted at all), not the default encoding...
msg42759 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2003-02-10 17:46
Logged In: YES user_id=38388 Ok, let's look at it from a different angle: things that you get from os.listdir() should be compatible to (at least) all the os.path tools and os itself. Converting to Unicode has the advantage that slicing and indexing into the path names will not break the paths (unlike UTF-8 encoded 8-bit strings which tend to break when you slice them). That said, I think you're right about the ASCII approach provided that the os, os.path tools can actually properly cope with Unicode. What I worry about is that if os.listdir() gives back Unicode for e.g. Latin-1 filenames and the application then passes the Unicode names to a C API using "s", prefectly working code will break... then again the C code should really use "es" for decoding to the Py_FileSystemDefaultEncoding as is done in e.g. fileobject.c. I really don't know what to do here...
msg42760 - (view)	Author: Just van Rossum (jvr) *	Date: 2003-02-10 18:17
Logged In: YES user_id=92689 I'm pretty sure os.path deals just fine with unicode strings (it's all pure string manipulations, isn't it?) Worries: well, apparently on Windows os.listdir() has been returning unicode for some time, so it's not like we're breaking completely new grounds here. If anything breaks it's probably good this happens, as it gives an opportunity to fix things... I just found several example of potential breakage: _bsddb.c parses a filename arg with the "z" format specifier. gdbmmodule.c uses "s". bsddbmodule.c and dbmmodule.c as well. I'm not sure the above modules work on Windows with non-ascii filenames at all, but it doesn't look like it. Besides Windows (for which my patch is not relevant), only OSX sets Py_FileSystemDefaultEncoding, so any new breakage won't reach a mass market right away <wink>.
msg42761 - (view)	Author: Just van Rossum (jvr) *	Date: 2003-02-25 15:55
Logged In: YES user_id=92689 Having missed 2.3a2, I'd like to get this in way ahead of 2.3b1. Any objections?
msg42762 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2003-02-25 17:22
Logged In: YES user_id=6380 OK, check it in, just be prepared for contingencies. I really cannot judge whether this is right on all platforms.
msg42763 - (view)	Author: Just van Rossum (jvr) *	Date: 2003-02-25 21:52
Logged In: YES user_id=92689 Checked in as rev. 2.287 of Modules/posixmodule.c. Leaving this item open for now, in case MvL has comments when he gets back.
msg42764 - (view)	Author: Jack Jansen (jackjansen) *	Date: 2003-03-03 11:23
Logged In: YES user_id=45365 I think this patch does more bad than good. A practical problem is that os.path.walk doesn't work anymore if there are non-ascii directories in the directory tree (os.listdir will return these as unicode names, but doesn't accept unicode on input). See bug #696261. An additional problem is that various other methods in posix don't do the unicode conversion, so for instance os.getcwd() will return 8-bit strings in Py_FileSystemDefaultEncoding which are incompatible with the unicode returned by listdir. My preferred solution would be to do the unicode trick everywhere. Second best would be to retract the whole thing and think about it a bit more for Python 2.4.
msg42765 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2003-03-03 11:36
Logged In: YES user_id=21627 I dislike this change, as it introduces inconsistency across platforms. On Win32, as a result of PEP 277, Unicode file names are only returned for Unicode directory names. There was an explicit discussion about this aspect of PEP 277, and this interface was accepted as The Right Thing. So I think Unix should follow here: return byte string file names for byte string directory names, and Unicode file names for Unicode directory names. Support for Unicode directory names should also invoke the file system encoding for the directory name. I'm also unsure about the exception handling. If there is a file name that doesn't decode according to the file system encoding, it raises the Unicode error. This means that all other file names are lost. This might be acceptable if the Unicode-in-Unicode-out strategy is used; in its current form, the change can and will break existing applications (which find all kinds of funny byte sequences on disk that don't work with the user's file system encoding).
msg42766 - (view)	Author: Just van Rossum (jvr) *	Date: 2003-03-03 12:22
Logged In: YES user_id=92689 Jack, as noted on #bug 696261, the bug is that os.listdir() doesn't do the right thing with a Unicode string argument (it should use Py_FileSystemDefaultEncoding but it doesn't; I'm working on it. Martin: I now see that PEP 277 says "Under this proposal, [os.listdir] will return a list of Unicode strings when its path argument is Unicode". I don't like this much (I really think we should push Unicode a little harder onto the users), but I'll look into changing the unix end of os.listdir() to do the same. I'll also review your exception comment.
msg42767 - (view)	Author: Just van Rossum (jvr) *	Date: 2003-03-03 13:02
Logged In: YES user_id=92689 I've attached a patch that fixes the bug as well as addresses the unicode arg vs. return value inconsistency that Martin noted. The exception behavior has not yet been changed.
msg42768 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2003-03-03 13:11
Logged In: YES user_id=21627 Looks good, but incomplete: If the argument is Unicode, all results should be Unicode. There should also be documentation changes.
msg42769 - (view)	Author: Just van Rossum (jvr) *	Date: 2003-03-03 14:32
Logged In: YES user_id=92689 Ok, done, including a minor patch to Doc/lib/libos.tex. I also adapted the Misc/NEWS items. I'm not sure how to change the os.listdir() doco to better reflect the actual situation without mentioning Py_FileSystemDefaultEncoding...
msg42770 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2003-03-03 15:48
Logged In: YES user_id=21627 I see. The right thing, IMO, is to always return Unicode objects for Unicode arguments, just the same way the "et" parser works: if the file system encoding is NULL, fall back to the system default encoding. Then, you can generalize the docs to [NT and Unix] (with OS X being a flavour of Unix), or drop the OS reference completely (in which case the other os modules are effectively buggy). There might be a function already to fall back to the system default encoding; perhaps just passing NULL works. There should be a documentation section on Unicode file names; I volunteer to write it (Summary: NT+ uses Unicode natively, W9x uses "mbcs", OS X uses UTF-8, which equates to "Unicode natively", Unices with nl_langinfo(CODEPAGE) use that, all others use the system default encoding).
msg42771 - (view)	Author: Just van Rossum (jvr) *	Date: 2003-03-03 16:08
Logged In: YES user_id=92689 I think this could be achieved by removing the "Py_FileSystemDefaultEncoding != NULL" part of the condition on line 1805, as indeed passing NULL as the encoding to PyUnicode_FromEncodedObject causes the default encoding to be used. Shall I check it in like that? I'm not quite happy with the fact that exceptions are silently dropped: should a warning be issued instead? Especially when using the default encoding, exceptions are not unlikely I suppose.
msg42772 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2003-03-03 16:39
Logged In: YES user_id=21627 Clearing the error is bad, I agree. I see two options: reraise the exception, deleting the result obtained so far (i.e. as the code did that the latest patch removes), OR add a byte string instead of the Unicode string into the result. Even though I have proposed the latter in the past, I could also accept the former; applications that anticipate that exception then just need to re-invoke listdir with a byte string, and deal with the result themselves. With these changes, the patch is fine with me.
msg42773 - (view)	Author: Just van Rossum (jvr) *	Date: 2003-03-03 17:45
Logged In: YES user_id=92689 Applied to CVS as: Modules/posixmodule.c: 2.288 Doc/lib/libos.tex: 1.115 Misc/NEWS: 1.687 Unicode errors are propagated as in the original version of the patch, libos.tex mentions Win NT/2k/XP and Unix.
msg42774 - (view)	Author: Just van Rossum (jvr) *	Date: 2003-03-03 17:56
Logged In: YES user_id=92689 Martin, assigning this item to you. Please close it if you deem the changes in CVS correct.
msg42775 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2003-03-04 06:49
Logged In: YES user_id=21627 The current code looks fine to me. Closing this patch.
msg42776 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2003-03-04 14:01
Logged In: YES user_id=6380 I haven't seen the code, but I have a complaint. On Linux, when I have a file named '\xff' (i.e. its name is the single byte with value 255), os.listdir(u'.') gives me a UnicodeDecodeError. Is that really progress?
msg42777 - (view)	Author: Just van Rossum (jvr) *	Date: 2003-03-04 14:31
Logged In: YES user_id=92689 Would you prefer the error be silenced and a byte string be used instead? If so, should there be a warning?
msg42778 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2003-03-04 14:40
Logged In: YES user_id=21627 Guido's scenario was precisely the reason why Unix was left out from consideration for PEP 277. However, it is better than it sounds: There is a good chance that invoking locale.setlocale(locale.LC_CTYPE, "") prior to invoking listdir will overcome the problem, as the setlocale call will set the file system encoding to the user's preference. If \xff is a valid file name in the user's preferred encoding, then listdir will succeed in converting this file name to a Unicode string. It might be useful to set the file system encoding on Unix to the user's preferred encoding unconditionally (i.e. not as a side effect of invoking setlocale). It might also be useful to expose the file system encoding read-only for inspection.
msg42779 - (view)	Author: Just van Rossum (jvr) *	Date: 2003-03-04 14:51
Logged In: YES user_id=92689 It would seem that even with a user's locale there's a chance os.listdir() fails when passed a unicode argument. I'm not sure it's reasonable for os.listdir() to fail at all (if the directory to be listed exists and we the right permissions). If it's all too difficult to get right, I'm happy to put the listdir unicode support in a MacOSX switch. I know nothing about locales so I'm really not in a position to straighten this out. All I know is that if Py_FileSystemDefaultEncoding is known to be utf-8, it's just dumb _not_ to return unicode. You guys figure out the rest.
msg42780 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2003-03-04 14:54
Logged In: YES user_id=6380 The setlocale call indeed works. I think I'd be happier if this was set by default, but I don't know what other consequences there would be.
msg42781 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2003-03-04 15:03
Logged In: YES user_id=6380 Maybe the filesystem default encoding should be set to Latin-1 by default (when nothing better is known about it)? Then it's hard to imagine how the conversion could fail, since every Latin-1 byte maps 1-1 to the corresponding Unicode code point.
msg42782 - (view)	Author: Just van Rossum (jvr) *	Date: 2003-03-04 15:07
Logged In: YES user_id=92689 I think it would be better to simply return byte strings if the file system encoding isn't know. (This btw. was what my original patch did.)
msg42783 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2003-03-04 15:11
Logged In: YES user_id=21627 I disagree with the last assertion: In particular if the file system encoding is UTF-8, there is a good chance that decoding will fail (unlike if it is latin-1; decoding will then never fail - it may just produce mojibake). OS X seems to make a guarantee to always return UTF-8 from its low-level API, but I distrust this guarantee until I see it with my own eyes :-) E.g. what happens if you mount an NFS tree, and the NFS server gives file names in some other encoding? I see the following options: - only enable the code for OS X. I dislike this option, as it essentially freezes the Unix status to non-Unicode (we won't get further insights, the de jure status won't change, de facto, all files will be encoded in the locale's encoding). - leave the code as-is, documenting the possibility of exceptions. - add byte strings instead of Unicode strings into the result for non-decodable strings. This gives a mixed-type result, which is fine if you only pass the resulting file names to stat() or open(), and will likely break the application if it tries to display the file names somehow.
msg42784 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2003-03-04 15:15
Logged In: YES user_id=21627 Setting the file system encoding on startup should be fine, except that we need another setlocale/query/restore locale sequence. This is, in principle, bad, as there is no guarantee that the restore locale operation really produces the original state, and may cause problems if other threads are already running. In practice, it appears to work out just fine, as we use such sequences already (e.g. to undo the readline initialization).
msg42785 - (view)	Author: Jack Jansen (jackjansen) *	Date: 2003-03-04 15:44
Logged In: YES user_id=45365 I just did a test (created 254 files with all bytes except / and null in their names on a linux server, mounted the partition over NFS on MacOSX) and indeed MacOSX tries to interpret the bytes as UTF-8 and fails. I know that conversion works for HFS and HFS+ volumes (which carry a filename encoding with them, or you have to specify it when mounting). I assume it works for AFP and SMB (which also carries encoding info, IIRC) but I can't test this. I haven't a clue about webdav and such. Something to keep in mind is that we are really trying to solve someone else's problem: the inability of NFS and most unixen to handle file system encodings. If I'm on a latin-1 machine and I nfs-mount your latin-2 partition I will see garbage filenames.
msg42786 - (view)	Author: Just van Rossum (jvr) *	Date: 2003-03-04 15:50
Logged In: YES user_id=92689 Here's a note about file system encodings on OSX, including a few words about NFS: http://developer.apple.com/qa/qa2001/qa1173.html. I propose to fall back to a byte string if conversion to unicode fails.
msg42787 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2003-03-04 16:00
Logged In: YES user_id=21627 I only partially agree that this is somebody else's problem: On Unix, it is always considered application responsibility to interpret file names as characters if they need to - hence the lack of a system-provided encoding strategy. So it is the problem of Python or the Python application, and I think we should try to shield the application from these issues as good as we can. Therefore, I'm in favour of jvr's latest proposal (use byte strings as the last resort), hoping that the error case will be unfrequent.
msg42788 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2003-03-04 16:26
Logged In: YES user_id=6380 On the one hand a user who isn't interested in encodings shouldn't be passing a Unicode argument. On the other hand, Unicode strings have a way of sneaking into your application when you least suspect them. E.g. Tkinter returns them, so does IDLE, and I see them used more and more in Zope 3. FWIW, I like Just's "fall back to bytestrings" aproach.
msg42789 - (view)	Author: Just van Rossum (jvr) *	Date: 2003-03-04 19:43
Logged In: YES user_id=92689 I've committed the "fallback-to-byte-strings" behavior. It's in posixmodule.c rev. 2.290.

History
Date	User	Action	Args
2022-04-10 16:06:41	admin	set	github: 37949
2003-02-09 21:43:09	jvr	create