classification
Title: os.listdir can return byte strings
Type: behavior Stage:
Components: Documentation Versions: Python 3.0
process
Status: closed Resolution: accepted
Dependencies: Superseder:
Assigned To: georg.brandl Nosy List: HWJ, amaury.forgeotdarc, bboissin, benjamin.peterson, djc, dlitz, draghuram, georg.brandl, gvanrossum, haypo, loewis, pitrou, zegreek
Priority: critical Keywords: patch

Created on 2008-06-24 10:28 by HWJ, last changed 2008-10-07 07:12 by loewis. This issue is now closed.

Files
File name Uploaded Description Edit
posix_path_bytes.patch haypo, 2008-08-22 15:19 Patch posixpath.join() to support bytes
io_byte_filename.patch haypo, 2008-08-22 15:33 open() allows bytes filename
fnmatch_bytes.patch haypo, 2008-08-22 16:53 Patch fnmatch.filter() to accept bytes filenames
glob1_bytes.patch haypo, 2008-08-22 16:54 Fix glob.glob() to accept invalid directory name
listdir_encoding_warning.patch benjamin.peterson, 2008-09-21 23:48
warn_at_the_end.patch benjamin.peterson, 2008-09-21 23:48
raise_decoding_errors.patch benjamin.peterson, 2008-09-23 23:05
force_unicode.patch benjamin.peterson, 2008-09-27 00:48
getcwd_bytes.patch haypo, 2008-09-27 14:50 getcwd() returns bytes if unicode conversion fails
merge_os_getcwd_getcwdu.patch haypo, 2008-09-29 12:51 Remove os.getcwdu(); os.getcwd(bytes=True) returns bytes
os_getcwdb.patch haypo, 2008-09-29 15:58 Fix getcwd() (use PyUnicode_Decode) and create getcwdb()->bytes
python3_bytes_filename.patch haypo, 2008-09-30 01:08 Patch for an initial support of bytes filename in Python3
setfsenc.diff loewis, 2008-09-30 15:21
python3_bytes_filename-3.patch haypo, 2008-10-02 17:10 Patch for an initial support of bytes filename in Python3 (version 3)
win32-bytes-filenames.patch amaury.forgeotdarc, 2008-10-03 01:45
macpath.patch amaury.forgeotdarc, 2008-10-03 21:21
library_os_doc.patch haypo, 2008-10-06 23:34
Messages (79)
msg68674 - (view) Author: Helmut Jarausch (HWJ) Date: 2008-06-24 10:28
The script below produces 1664 lines of output before it bails out with
Traceback (most recent call last):
  File "WalkBug.py", line 5, in <module>
for Dir, SubDirs, Files in os.walk('/home/jarausch') :
  File "/usr/local/lib/python3.0/os.py", line 278, in walk
for x in walk(path, topdown, onerror, followlinks):
  File "/usr/local/lib/python3.0/os.py", line 268, in walk
if isdir(join(top, name)):
  File "/usr/local/lib/python3.0/posixpath.py", line 64, in join
if b.startswith('/'):
TypeError: expected an object with the buffer interface

=========================
file  WalkBug.py:

#!/usr/local/bin/python3.0

import os

for Dir, SubDirs, Files in os.walk('/home/jarausch') :
  print("processing {0:d} files in {1}".format(len(Files),Dir))
msg68679 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2008-06-24 11:45
Could you tell us what this 1665th line should be?
Maybe the 1665th directory has something special (a filename with spaces
or non-ascii chars...)

Can you try with an older version of python?
msg68684 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2008-06-24 13:02
It's failing because he's giving a string to bytes.startswith when it
requires a byte string or such.
msg68685 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2008-06-24 13:12
"he's giving a string"... the user simply called os.walk, which accepts
strings AFAIK.

We should discover what produced this bytestring. Does listdir() returns
a mixed list of strings and bytes?
msg68686 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2008-06-24 13:21
It seems the conversion to unicode strings (PyUnicode vs PyBytes) was
not complete in os.listdir. See the attached patch.
msg68688 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2008-06-24 13:37
The original problem seems to come from some Unix platform, but this
patch only handles two cases:
- on win32, when the argument is a bytestring.
- on OS/2.
And in both cases, the default (utf-8) conversion seems wrong. Something
like cp1252 (the ANSI code page for Western Windows) would be more sensible.

In the posix part of the function, there is the comment (2003-03-04):
    /* fall back to the original byte string, as
       discussed in patch #683592 */
btw, I find the penultimate message of this other thread very pleasant,
in the py3k context... I suppose the conclusions would not be the same
today.
msg68689 - (view) Author: Helmut Jarausch (HWJ) Date: 2008-06-24 14:09
>> Could you tell us what this 1665th line should be?
>> Maybe the 1665th directory has something special (a filename with >> 
>> spaces or non-ascii chars...)

Yes, the next directory contains a filename with an iso-latin1 but non-
ascii character

>> Can you try with an older version of python?
No problems - runs every night here

The patch (applied to SVN GMT 13:30) does NOT help.
msg70943 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-08-09 17:35
Hmm, I suppose that while the filename is latin1-encoded,
Py_FileSystemDefaultEncoding is "utf-8" and therefore os.listdir fails
decoding the filename and falls back on returning a byte string.
It was acceptable in Python 2.x but is a very annoying problem in py3k
now that unicode and bytes objects can't be mixed together anymore. I'm
bumping this to critical, although there is probably no clean solution.
msg70953 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2008-08-10 00:17
Let's make this a release blocker for RCs.
msg71525 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-08-20 09:37
See #3616 for a consequence of this.
msg71612 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2008-08-21 08:20
If the filename can not be encoded correctly in the system charset, 
it's not really a problem. The goal is to be able to use open(), 
shutil.copyfile(), os.unlink(), etc. with the given filename.

orig = filename from the kernel (bytes)
filename = filename from listdir() (str)
dest = filename to the kernel (bytes)

The goal is to get orig == dest. In my program Hachoir, to workaround 
this problem I store the original filename (bytes) and convert it to 
unicode with characters replacements (eg. replace invalid byte 
sequence by "?"). So the bytes string is used for open(), 
unlink(), ... and the unicode string is displayed to stdout for the 
user.

IMHO, the best solution is to create such class:

class Filename:
    def __init__(self, orig):
        self.as_bytes = orig
        self.as_str = myformat(orig)
    def __str__(self):
        return self.as_str
    def __bytes__(self):
        return self.as_bytes

New problems: I guess that functions operating on filenames 
(os.path.*) will have to support this new type (Filename class).
msg71615 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-08-21 08:51
Selon STINNER Victor <report@bugs.python.org>:
> IMHO, the best solution is to create such class:
>
> class Filename:
>     def __init__(self, orig):
>         self.as_bytes = orig
>         self.as_str = myformat(orig)
>     def __str__(self):
>         return self.as_str
>     def __bytes__(self):
>         return self.as_bytes

I agree that logically it's the right solution. It's also the most invasive. If
that class is made a subclass of str, however, existing code shouldn't break
more than it currently does.
msg71624 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2008-08-21 10:59
I wrote a Filename class. I tries different methods:
 * no parent class "class Filename: ..." -> I don't know how to make 
bytes(filename) works!? But it's the best option to avoid strange bugs 
(mix bytes/str, remember Python 2.x...)
 * str parent class "class Filename(str): ..." -> doesn't work because 
os functions uses the fake unicode filename before testing the bytes 
(real) filename
 * bytes parent class "class Filename(bytes): ..." -> that's the 
current implementation

The idea is to encode str -> bytes (and not bytes -> str because we 
want to avoid problems with such conversions). So I reimplemented most 
bytes methods: __addr__, __raddr__, __contains__, startswith, endswith 
and index. index method has no start/end arguments since the behaviour 
would be different than a real unicode string :-/

I added an example of fixed os.listdir(): create Filename() object if 
we get bytes. Should we always create Filename objects? I don't think 
so.
msg71629 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-08-21 12:55
>  * bytes parent class "class Filename(bytes): ..." -> that's the
> current implementation

I don't think that makes sense (especially under Windows which has Unicode file
APIs). os.listdir() and friends should really return str or str-like objects,
not bytes-like objects with an additional __str__ method.

>  * str parent class "class Filename(str): ..." -> doesn't work because
> os functions uses the fake unicode filename before testing the bytes
> (real) filename

Well, of course, if we create a filename type, then all os functions must be
adapted to accept it rather than assume str.

All this is highly speculative of course, and if we really follow this course
(i.e. create a filename type) it should probably be postponed to 3.1: too many
changes with far-reaching consequences.
msg71647 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2008-08-21 14:59
Le Thursday 21 August 2008 14:55:43 Antoine Pitrou, vous avez écrit :
> >  * bytes parent class "class Filename(bytes): ..." -> that's the
> > current implementation
>
> I don't think that makes sense (especially under Windows which has Unicode
> file APIs). os.listdir() and friends should really return str or str-like
> objects, not bytes-like objects with an additional __str__ method.

In we use "class Filename(str): ...", we have to ensure that all operations 
takes care of the charset because the unicode version is invalid and not be 
used to access to the file system. Dummy example: Filename()+"/" should not 
return str but raise an error or create a new filename.

> Well, of course, if we create a filename type, then all os functions must
> be adapted to accept it rather than assume str.

If Filename has no parent class but is convertible to bytes(), os functions 
requires no change and so we can fix it before final 3.0 ;-)
msg71648 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-08-21 15:09
> If Filename has no parent class but is convertible to bytes(), os
> functions requires no change and so we can fix it before final 3.0 ;-)

This sounds highly optimistic.

Also, I think it's wrong to introduce a string-like class with implicit
conversion both to bytes and to str, while we have taken all measures to
make sure that bytes/str exchangeability doesn't exist any more in py3k.
msg71655 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2008-08-21 16:17
The proper work-around is for the app to pass bytes into os.listdir();
then it will return bytes.  It would be nice if open() etc. accepted
bytes (as well as strings of course), at least on Unix, but not
absolutely necessary -- the app could also just know the right encoding.

I see two reasonable alternatives for what os.listdir() should return
when the input is a string and one of the filenames can't be decoded:
either omit it from the output list; or use errors='replace' in the
encoding.  Failing the entire os.listdir() call is not acceptable, and
neither is returning a mixture of str and bytes instances.
msg71680 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2008-08-21 20:55
Le Thursday 21 August 2008 18:17:47 Guido van Rossum, vous avez écrit :
> The proper work-around is for the app to pass bytes into os.listdir();
> then it will return bytes.

In my case, I just would like to remove a directory with shutil.rmtree(). I 
don't know if it contains bytes or characters filenames :-)

> It would be nice if open() etc. accepted 
> bytes (as well as strings of course), at least on Unix, but not
> absolutely necessary -- the app could also just know the right encoding.

An invalid filename has no charset. It's just a "raw" byte string. So open(), 
unlink(), etc. have to accept byte string. Maybe not in the Python version 
with in low level (C version)?

> I see two reasonable alternatives for what os.listdir() should return
> when the input is a string and one of the filenames can't be decoded:
> either omit it from the output list;

It's not a good option: rmtree() will fails because the directory in not 
empty :-/

> or use errors='replace' in the encoding.

It will also fails because filenames will be invalid (valid unicode string but 
non existent file names :-/).

> Failing the entire os.listdir() call is not acceptable, and 
> neither is returning a mixture of str and bytes instances.

Ok, I have another suggestion:
 - *by default*, listdir() only returns str and raise an error (TypeError?) 
   on invalid filename
 - add an optional argument (a callback), eg. "fallback_encoder", to catch
   such errors (similar to "onerror" from shutils.rmtree())

Example of new listdir implementation (pseudo-code):

   charset = sys.getfilesystemcharset()
   dirobj = opendir(path)
   try:
      for bytesname in readdir(dirobj):
          try:
              name = str(bytesname, charset)
          exept UnicodeDecodeError:
              name = fallback_encoder(bytesname)
          yield name
   finally:
      closedir(dirobj)

The default fallback_encoder:

   def fallback_encoder(name):
      raise

Keep raw bytes string:

   def fallback_encoder(name):
      return name

Create my custom filename object:

   class Filename:
      ...

   def fallback_encoder(name):
      return Filename(name)

If a callback is overkill, we can just add an option, 
eg. "keep_invalid_filename=True", to ask listdir() to keep bytes string if 
the conversion to unicode fails.

In any case, open(), unlink(), etc. have to accept byte string to be accept to 
read, copy, remove invalid filenames. In a perfect world, all filenames would 
be valid UTF-8 strings, but in the real world (think to Matrix :-)), we have 
to support such strange cases...
msg71699 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2008-08-21 23:31
So shutil should be fixed to pass a bytes value to os.listdir().  But
then os.remove() should be fixed to accept bytes as well.  This is the
crux I believe: on Unix at least, syscall wrappers should accept bytes
for filenames.  And this would then have to be extended to things like
the functions in os.path, and we'd need bytes versions of os.sep and
os.altsep...  This sounds like a good project for 3.1.

I do not accept an os.listdir() that raises an error because one
filename cannot be decoded.  It sounds like using errors='replace' is
also wrong -- so the only solution is for os.listdir() to skip files it
cannot decode.  While this doesn't help for rmtree(), it is better than
errors='replace' for code that descends into the tree looking for files
matching a pattern or other property.  So I propose this as a patch for 3.0.

The callback variant is too complex; you could write it yourself by
using os.listdir() with a bytes argument.  This also applies to
proposals like passing optional encoding and errors arguments to
os.listdir().
msg71700 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2008-08-21 23:35
On Thu, Aug 21, 2008 at 6:31 PM, Guido van Rossum
<report@bugs.python.org> wrote:
>
> Guido van Rossum <guido@python.org> added the comment:
>
> So shutil should be fixed to pass a bytes value to os.listdir().  But
> then os.remove() should be fixed to accept bytes as well.  This is the
> crux I believe: on Unix at least, syscall wrappers should accept bytes
> for filenames.  And this would then have to be extended to things like
> the functions in os.path, and we'd need bytes versions of os.sep and
> os.altsep...  This sounds like a good project for 3.1.
>
> I do not accept an os.listdir() that raises an error because one
> filename cannot be decoded.  It sounds like using errors='replace' is
> also wrong -- so the only solution is for os.listdir() to skip files it
> cannot decode.  While this doesn't help for rmtree(), it is better than
> errors='replace' for code that descends into the tree looking for files
> matching a pattern or other property.  So I propose this as a patch for 3.0.

As much as this maybe the right idea, I don't like the idea of
silently losing the contents of a directory. That's asking for
difficult to discover bugs. Could Python emit a warning in this case?
>
> The callback variant is too complex; you could write it yourself by
> using os.listdir() with a bytes argument.  This also applies to
> proposals like passing optional encoding and errors arguments to
> os.listdir().
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue3187>
> _______________________________________
>
msg71705 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2008-08-21 23:59
>> I do not accept an os.listdir() that raises an error because one
>> filename cannot be decoded.  It sounds like using errors='replace' is
>> also wrong -- so the only solution is for os.listdir() to skip files it
>> cannot decode.  While this doesn't help for rmtree(), it is better than
>> errors='replace' for code that descends into the tree looking for files
>> matching a pattern or other property.  So I propose this as a patch for 3.0.
>
> As much as this maybe the right idea, I don't like the idea of
> silently losing the contents of a directory. That's asking for
> difficult to discover bugs.

Well, the other approaches also cause difficult to discover bugs (the
original bug report here was an example :-).

> Could Python emit a warning in this case?

This may be the best compromise yet. It would have to use the warnings
module so that you could disable it.
msg71748 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2008-08-22 14:21
I implemented the "invalid filename" class feature:
 - by default, os.listdir() raise an error (UnicodeDecodeError) on 
invalid filename. The previous behaviour was to return bytes object 
instead of str.
 - if invalid_filename=True: create an InvalidFilename class instance

InvalidFilename is not a bytes string, it's not a str string, it's a 
new class. It has three attributes:
 - bytes: the real filename
 - charset: charset (type str)
 - str: fake filename (type str) used by __str__() method

My patch also fixes os.path.join() to accept InvalidFilename: if at 
last one argument is an InvalidFilename, use InvalidFilename.join() 
(class method).

os.listdir() and os.unlink() are patched to accept InvalidFilename. 
unlink always accept InvalidFilename whereas listdir() only produces 
InvalidFilename is os.listdir(path, invalid_filename=True) is used.

I added an optional argument "invalid_filename" to shutil.rmtree(), 
default value is *True*.

To sum up, visible changes:
 - os.listdir() raise an error on invalid filename instead of return a 
mixed list of str and bytes
 - shutil.rmtree() manipulate str and InvalidFilename instead of str 
and bytes
msg71749 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2008-08-22 14:53
I'm not interested in the InvalidFilename class; it's an API
complification that might seem right for your situation but will hinder
most other people.  However I *am* interested in a patch that makes
os.unlink() (and as many other functions as you can think of) accept
bytes.  You'll have to think what encoding to use on Windows though,
since (AFAIK) the Windows filesystem APIs *do* use Unicode.
msg71751 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2008-08-22 15:19
@gvanrossum: os.unlink() and os.lstat() already accept byte filenames 
(but open() doesn't).

Ok, here is very small patch for posixpath.join() to accept bytes 
strings. This patch is enough to fix my initial problem (#3616).
msg71752 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2008-08-22 15:33
My last patch (posix_join_bytes.patch) is also enough to fix the 
initial reported problem: error in posixpath.join() called by 
os.walk(). I tried os.walk() on a directory with invalid filenames and 
invalid directory name and it works well.

So the last bug is open() which disallow opening a file with an 
invalid name. So here is another patch for that.
msg71756 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2008-08-22 16:53
To continue in the "accept bytes filenames" way, a new patch for 
fnmatch.filter(). Use sys.getfilesystemencoding() to convert the 
bytes. The patch contains a new unit test.
msg71757 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2008-08-22 16:54
Patch glob.glob() to accept directory with invalid filename (invalid 
in the filesystem charset): just ignore bytes => str conversion error.
msg71769 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2008-08-22 19:05
See http://codereview.appspot.com/3055 for a code review of Victor's
latest patches.
msg71991 - (view) Author: Dwayne Litzenberger (dlitz) Date: 2008-08-26 18:15
I think Guido already understands this, but I haven't seen it stated
very clearly here:

** Different systems use different "things" to identify files. **

On Linux/ext3, all filenames are *octet strings* (i.e. bytes), and
*only* the following caveats apply:
- a filename/pathname cannot contain the zero-octet (b"\x00").
- a filename/pathname cannot be empty.
- a filename cannot contain the slash (b"/"); In a pathname, the slash
is used to separate filenames.
- the filenames b"." and b".." have special meanings; They cannot be
created, deleted, or renamed.

All filenames that meet these criteria are valid, and calling them
"invalid" amounts to plugging one's ears and shouting "LA LA LA" while
imagining Unicode having pre-dated Unix.

It is sometimes convenient to imagine filenames on Linux/ext3 as
sequences of Unicode code points (where the encoding is specified by
LC_CTYPE---it's not necessarily UTF-8), but other times (e.g. in backup
tools that need to be robust in the face of mischievous users) it is an
unnecessary abstraction that introduces bugs.

On Windows/NTFS, the situation is entirely different: Filenames are
actually sequences of Unicode code points, and if you pretend they are
octet strings, Windows will happily invent phantom filenames for you
that will show up in the output of os.listdir(), but that will return
"File not found" if you try to open them for reading (if you open them
for writing, you risk clobbering other files that happens to have the
same names).

To avoid bugs, it should be possible to work exclusively with filenames
in the platform's native representation.  It was possible in Python 2
(though you had to be very careful).  Ideally, Python 3 would recognize
and enforce the difference instead of trying to guess the translations;
"Explicit is better than implicit" and all that.
msg72495 - (view) Author: Baptiste Carvello (zegreek) Date: 2008-09-04 12:03
If, as I understand, it is the application's job to call listdir with
bytes or unicode depending on the platform, it might be useful to have a
function in the os module telling whether the filesystem is bytes of
unicode-native. 

That way, the idiom for doing something to all files in a given
directory would read:

>>> directory="my_direcory"
... if not os.is_filesystem_unicode():
...     directory=directory.encode(sys.stdin.encoding)
... for name in os.listdir(directory):
...     f=open(name)
... # do something
...     f.close()

and it would work on all platforms, even if one of the filenames is not
in the locale's normal encoding.

If this idiom is correct, it could also be useful to include it in the
module documentation so that application authors know what they are
supposed to do.
msg73362 - (view) Author: Helmut Jarausch (HWJ) Date: 2008-09-18 07:39
Hi,
is this assumed to be fixed in 3.0rc1 ?

with SVN 66506  (3.0rc1+) 
for dirname, subdirs, files in os.walk(bytes(Top,'iso-8859-1')) :

still gives an error here:

    for dirname, subdirs, files in os.walk(bytes(Top,'iso-8859-1')) :
  File "/usr/local/lib/python3.0/os.py", line 268, in walk
    if isdir(join(top, name)):
  File "/usr/local/lib/python3.0/posixpath.py", line 64, in join
    if b.startswith('/'):
TypeError: expected an object with the buffer interface
msg73534 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2008-09-21 21:44
Here's a potential patch for listdir. It emits a UnicodeWarning (or
should that be a BytesWarning?) and skips the file when decoding fails.
What would be the best way to test this?
msg73535 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2008-09-21 22:05
I did not test the patch, but I have some remarks about it:
- %r does not seem to be handled by PyUnicode_FromFormat; %R maybe?
- In this case, PyObject_Repr(v) is not necessary - and this will avoid 
a reference leak.
- Does the warning warn multiple times? IIRC the default behaviour is to 
warn once.
msg73540 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2008-09-21 23:48
Here's two more patches. One is like the old one with Amaury's comments
observed. The other simply notes if there were decoding problems and
warns once at the end of the listdir call.

Making a warning happen more than once is tricky because it requires
messing with the warnings filter. This of course takes away some of the
user's control which is one of the main reasons for using the Python
warning system in the first place.

(I almost wish we could write another listdir that returned the names it
could decode and a list of those it couldn't.)
msg73678 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2008-09-23 23:05
Here's another patch. It simply propagates the UnicodeDecodeErrors. I
like this because it avoids silent ignoring problem, and people can get
bytes if they want by passing in a bytes path.
msg73680 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2008-09-24 00:01
Hmm... much of the os.path machinery (and os.walk) probably doesn't work
with bytes, and neither do fnmatch.py and glob.py, I expect.  Plus
io.open() refuses bytes for the filename, even though _fileio accepts
them.  The latter should be fixed regardless, and one of the attachments
here has a fix IIRC.

Gotta run, sorry.
msg73688 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2008-09-24 00:41
Guido compiled my patches here: http://codereview.appspot.com/3055

My patches allows bytes for fnmatch.filter(), glob.glob1(), 
os.path.join() and open().
msg73909 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2008-09-27 00:47
Ok. Here's another possibility. It adds another optional parameter to
listdir. If False, bytes strings can be returned. Otherwise, the
UnicodeDecodeError is reraised.
msg73910 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2008-09-27 01:15
On Fri, Sep 26, 2008 at 5:47 PM, Benjamin Peterson
<report@bugs.python.org> wrote:
> Ok. Here's another possibility. It adds another optional parameter to
> listdir. If False, bytes strings can be returned. Otherwise, the
> UnicodeDecodeError is reraised.

I don't see the advantage over the existing rule bytes in -> bytes out...
msg73911 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2008-09-27 01:23
Does that mean that the right thing to do is raise decoding errors when
unicode is given and fix the path modules so they can use bytes?
msg73925 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2008-09-27 14:50
getcwd() fails with "NOT FOUNT" (not foun*d*?) if the current 
directory filename can't be converted to unicode (str type). Here is a 
patch to fallback to bytes if creation of the unicode failed.
msg73926 - (view) Author: Dwayne Litzenberger (dlitz) Date: 2008-09-27 15:12
On Sat, Sep 27, 2008 at 01:15:46AM +0000, Guido van Rossum wrote:
> I don't see the advantage over the existing rule bytes in -> bytes out...

Guido,

I figure I should say something since I have some experience in this area.

I wrote some automatic backup software in Python 2 earlier this year.  It
had to work on ext3/Linux (where filenames are natively octet-strings) and
on NTFS/Win32 (where filenames are natively unicode-strings).  I had to be
ridiculously careful to always use unicode paths on Win32, and to always
use str paths on Linux, because otherwise Python would do the conversion
automatically---poorly.

It was particularly bad on Win32, where if you used os.listdir() with a
non-unicode path (Python 2.x str object) in a directory that contained
non-ascii filenames, Windows would invent filenames that looked similar but
couldn't actually be found when using open().  So, naive (Python 2) code
like this would break:

    for filename in os.listdir("."):
        f = open(filename, "rb")
        # ...

On Linux, it was bad too, since if you used unicode paths, the filenames
actually opened would depend on your LANG or LC_CTYPE or LC_ALL environment
variables, and those could vary from one system to another, or even from
one invocation of the program to another.

The simple fact of the matter is that pathnames on Linux are _not_ Unicode,
and pathnames on Windows are _not_ octet strings.  They're fundamentally
incompatible types that can only be reconciled when you make assumptions
(e.g. specifying a character encoding) that allow you to convert from one
to the other.

Ideally, io.open(), os.listdir(), os.path.*, etc. would accept _only_
pathnames in their native format, and it would be the job of a wrapper to
provide a portable-but-less-robust interface on top of that.  Perhaps the
built-in functions would use the wrapper (with reasonable defaults), but
the native-only interface should be there for module-writers who want
robust pathname handling.
msg73992 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-09-28 21:31
I'd like to propose yet another approach: make sure that conversion
according to the file system encoding always succeeds. If an
unconvertable byte is detected, map it into some private-use character.
To reduce the chance of conflict with other people's private-use
characters, we can use some of the plane 15 private-use characters, e.g.
map byte 0xPQ to U+F30PQ (in two-byte Unicode mode, this would result in
a surrogate pair).

This would make all file names accessible to all text processing
(including glob and friends); UI display would typically either report
an encoding error, or arrange for some replacement glyph to be shown.

There are certain variations of the approach possible, in case there is
objection to a specific detail.
msg73999 - (view) Author: Dwayne Litzenberger (dlitz) Date: 2008-09-29 00:54
Martin,

Consider this scenario.  On ext3/Linux, assume that UTF-8 is specified
in the system locale.  What would happen if you have two files, named
b"\xf3\xb3\x83\x80\x00" and b"\xc0\x00"?  Under your proposal, the first
file would decode successfully as "\U000f30c0\x00", and the second file
would decode unsuccessfully, so it would be mapped to
"\U000f30c0\x00"---the same thing!

Under your proposal, you could end up with multiple files having the
same filename (from Python's perspective). Python shouldn't break if
somebody deliberately created some weird filenames.  Your proposal would
make it impossible to write a robust remote backup tool in Python 3.

Pathnames on ext3/Linux *are not Unicode*.  Blindly pretending they're
Unicode is a leaky abstraction at best, and a security hole at worst.
msg74000 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2008-09-29 02:51
You can call it a leaky abstraction all you want, but most people think
of filenames as text strings most of the time, and we need to somehow
support this, at least for users who agree .  I agree we also need to
support bytes strings (at least on Unix) in order to support backup
routines, and support for bytes in -> bytes out in os.listdir() is meant
for this.  The open() function should also support a pure bytes filename
(and almost does so -- _fileio does, but io.py doesn't yet). 
os.getcwd() is a weird case and will probably need to be given a flag to
make it return bytes (I don't like that style of API much, but the
alternative is perhaps worse -- os.getcwd_bytes()).

Conclusion: I support patches that make the I/O library work with either
bytes or strings.  (It's OK if the bytes don't actually work on Windows,
where the native type is apparently strings -- though it has a bytes API
too, doesn't it?)
msg74006 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-09-29 04:45
> Consider this scenario.  On ext3/Linux, assume that UTF-8 is specified
> in the system locale.  What would happen if you have two files, named
> b"\xf3\xb3\x83\x80\x00" and b"\xc0\x00"?  Under your proposal, the first
> file would decode successfully as "\U000f30c0\x00", and the second file
> would decode unsuccessfully, so it would be mapped to
> "\U000f30c0\x00"---the same thing!

Correct.

> Under your proposal, you could end up with multiple files having the
> same filename (from Python's perspective). Python shouldn't break if
> somebody deliberately created some weird filenames.

I'm not so sure about that. Practicality beats purity.

> Your proposal would
> make it impossible to write a robust remote backup tool in Python 3.

There could be an option to set the file system encoding via an API
to some known safe value, such as Latin-1, or ASCII. If you set the
file system encoding to Latin-1, this escaping would never happen;
if you set it to ASCII, it would happen uniformly for all non-ASCII
bytes. The robust backup tool would have to know to set this option
on POSIX systems.

> Pathnames on ext3/Linux *are not Unicode*.  Blindly pretending they're
> Unicode is a leaky abstraction at best, and a security hole at worst.

I think most Linux users would disagree, and claim that file names are
indeed character strings (which is synonym to "being Unicode"). It is
technically true that it's possible to create file names which are not
text, but that's really a bug, not a feature - Unix and POSIX were never
intended to work this way. Also, in the overwhelming majority of Python
applications, consistent support for practically-existing systems
matters more than robustness against malicious users.
msg74007 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-09-29 04:47
> I agree we also need to
> support bytes strings (at least on Unix) in order to support backup
> routines

How about letting such applications set the file system encoding to
Latin-1?
msg74008 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-09-29 05:11
James Knight points out that UTF-8b can be used to give unambiguous
round-tripping of characters in a UTF-8 locale. So I would like to amend
my previous proposal:
- for a non-UTF-8 encoding, use private-use characters for roundtripping
- if the locale's charset is UTF-8, use UTF-8b as the file system encoding.
msg74027 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2008-09-29 12:51
About os.getcwd(), another solution is merge_os_getcwd_getcwdu.patch: 
os.getcwd() always return unicode string and raise an error on unicode 
decode error. Wheras os.getcwd(bytes=True) always return bytes. 

The old function os.getcwdu() is removed since os.getcwd() already 
return unicode string.

Note: current version of os.getcwd() uses the wrong encoding to 
conversion bytes to unicode: it uses PyUnicode_FromString() instead of 
PyUnicode_Decode(..., Py_FileSystemDefaultEncoding, "strict") (as does 
getcwdu()).
msg74032 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2008-09-29 15:58
As Steven Bethard proposed, here is a new version of my getcwd() 
patch: instead of adding a keyword argument "bytes", I created a 
function getcwdb():
 * os.getcwd() -> unicode
 * os.getcwdb() -> bytes

In Python2 it was:
 * os.getcwd() -> str (bytes)
 * os.getcwdu() -> unicode
msg74059 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2008-09-30 01:08
Patch python3_bytes_filename.patch:
 - open() support bytes
 - listdir(unicode) -> only unicode, *skip* invalid filenames 
   (as asked by Guido)
 - remove os.getcwdu()
 - create os.getcwdb() -> bytes
 - glob.glob() support bytes
 - fnmatch.filter() support bytes
 - posixpath.join() and posixpath.split() support bytes

Mixing bytes and str is invalid. Examples raising a TypeError:
 - posixpath.join(b'x', 'y')
 - fnmatch.filter([b'x', 'y'], '*')
 - fnmatch.filter([b'x', b'y'], '*')
 - glob.glob1('.', b'*')
 - glob.glob1(b'.', '*')

TODO:
 - review this patch :-)
 - support non-ASCII bytes in fnmatch.filter()
 - fix other functions, eg. posixpath.isabs() and 
fnmatch.fnmatchcase()
 - fix functions written in C: grep FileSystemDefaultEncoding
 - make sure that mixing bytes and str is rejected
msg74080 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-09-30 15:21
Here is a patch that solves the issue in a different way: it introduces
sys.setfilesystemencoding. If applications invoke
sys.setfilesystemencoding("iso-8859-1"), all file names can be
successfully converted into a character string.
msg74083 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2008-09-30 17:04
On Tue, Sep 30, 2008 at 8:21 AM, Martin v. Löwis <report@bugs.python.org> wrote:
> Martin v. Löwis <martin@v.loewis.de> added the comment:
> Here is a patch that solves the issue in a different way: it introduces
> sys.setfilesystemencoding. If applications invoke
> sys.setfilesystemencoding("iso-8859-1"), all file names can be
> successfully converted into a character string.

I'm not opposed to this going in as well, but I don't think it's the
right approach, as it can cause severe cases of mojibake (which you
have strongly opposed in the past). It's quite orthogonal to Victor's
patch IMO.
msg74101 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2008-09-30 22:02
As I wrote, python3_bytes_filename.patch was just an initial support 
for bytes filename. So as asked by Guido, here is a new version of my 
patch.

Changes:
 - for all functions, support bytes as well as bytearray
 - os.readlink(unicode) -> unicode and raise an error if unicode 
conversion fails. Note: os.readlink(bytes)->bytes was already working.
 - many changes in posixpath to fix all functions: add many "if 
isinstance(...):" and repeat sep / curdir / parent / ... in bytes
 - current version of test_posixpath contains a duplicate to 
test_splitdrive() and test_normcase() calls normcase() twice which is 
wrong (fixed in my patch)
 - i used copy/paste + conversion to bytes to test posixpath with 
bytes arguments
 - i added some checks in posixpath tests to reject mixing bytes + str
 - fix quoting style
 - factorize pattern compilation in fnmatch
 - fnmatch.fnmatchcase() supports bytes
 - fix test_unicode_file: replace getcwdu() by getcwd(), and sometimes 
getcwd() by getcwdb()

Open issues:
 - pwd.getpwnam() and grp.getgrpnam() should accept bytes, and then
   expanduser() should use pwd with bytes. Now expanduser() 
   supposes that an username is an ASCII string and the user
   directory can be converted using getfilesystemencoding()
 - expandvars() doesn't support non-ASCII variable value:
   that's new problem. os.environ key should be str or bytes?
   And the value: str or bytes? It str is choosen, what is the
   charset to convert str to bytes?
msg74173 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2008-10-02 16:43
Martin, can you check in your changes to add sys.setfilesystemencoding()?

I will check in Victor's changes (with some edits).

Together this means that the various suggested higher-level solutions
(like returning path-like objects, or some kind of roudtripping
almost-but-not-quite-utf-8 encoding) can be implemented in pure Python.
msg74186 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2008-10-02 18:57
I have committed a somewhat modified version of Victor's latest patch as
r66743.

Assigning to MvL now for the sys.setfilesystemencoding() commit.  Then
you can hand it over to the doc folks; something derived from
http://wiki.python.org/moin/Python3UnicodeDecodeError should probably be
added to the Doc tree.
msg74192 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-10-02 19:35
> Martin, can you check in your changes to add sys.setfilesystemencoding()?

Will do tomorrow.
msg74222 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2008-10-03 01:45
Here is a patch for Windows:

The failing tests on buildbots now pass
(test_fnmatch test_posixpath test_unicode_file)

test_ntpath also runs functions with bytes.

I suppose macpath.py is broken as well.
msg74236 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2008-10-03 10:38
Le Friday 03 October 2008 03:45:44 Amaury Forgeot d'Arc, vous avez écrit :
> Here is a patch for Windows: (...)
> test_ntpath also runs functions with bytes.

Which charset is used when you use bytes filename? I read somewhere that it's 
the "current codepage". How can the user get this codepage in Python? I ask 
this to complete my document:
  http://wiki.python.org/moin/Python3UnicodeDecodeError

Don't hesitate to edit directly the document, which may also be moved to 
Python3 Doc/ directory.

You should also support bytearray() in ntpath:
   isinstance(path, (bytes, bytearray))

The unit tests might use pure unicode on Windows and bytes on Linux, 
especially getcwd() vs getcwdb().

I don't have Windows nor Mac to test bytes filenames on these systems.
msg74237 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-10-03 10:43
> Which charset is used when you use bytes filename?

It's the "ANSI" code page, which is a system-wide admin-modifiable
indirection to some real code page (changing it requires a reboot).
In the API, it's referred to as CP_ACP. It's also related to the
"multi-byte" API, which has caused Mark Hammond to call the codec
invoking it "mbcs" (IOW, "mbcs" is always the codec name for the
file system encoding). The specific code page that CP_ACP denotes
can be found with locale.getpreferredencoding(). Using that codec
name (which might be e.g. "cp1252") is different from using "mbcs",
as that goes through a regular (table-driven) Python codec. In
particular, the Python codec will report errors, whereas the "mbcs"
codec will find replacement characters.
msg74240 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-10-03 11:31
> You should also support bytearray() in ntpath:
>    isinstance(path, (bytes, bytearray))

The most generic way of allowing all bytes-alike objects is to write:
    path = bytes(path)

It raises a TypeError if `path` can't export a read-only buffer of
contiguous bytes; also, it is a no-op if `path` is already a bytes
object, so very cheap in the common case.
msg74241 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2008-10-03 11:43
> The most generic way of allowing all bytes-alike objects is to write:
>     path = bytes(path)

If you use that, any unicode may fails and the function will always return 
unicode. The goal is to get:
  func(bytes)->bytes
  func(bytearray)->bytes (or maybe bytearray, it doesn't matter)
  func(unicode)->unicode
msg74242 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-10-03 11:52
Le vendredi 03 octobre 2008 à 11:43 +0000, STINNER Victor a écrit :
> STINNER Victor <victor.stinner@haypocalc.com> added the comment:
> 
> > The most generic way of allowing all bytes-alike objects is to write:
> >     path = bytes(path)
> 
> If you use that, any unicode may fails and the function will always return 
> unicode. The goal is to get:
>   func(bytes)->bytes
>   func(bytearray)->bytes (or maybe bytearray, it doesn't matter)
>   func(unicode)->unicode

Then make it:

    path = path if isinstance(path, str) else bytes(path)
msg74246 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2008-10-03 12:23
path=path is useless most of the code (unicode path), this code is 
faster if both cases (bytes or unicode)!
   if not isinstance(path, str):
      path = bytes(path)

* a if b else c: unicode=0.756730079651; bytes=1.93071103096
* if test: path=...: unicode=0.681571006775; bytes=1.88843798637
msg74255 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-10-03 16:11
I've committed sys.setfilesystemencoding as r66769.

Declaring it as a documentation issue now. Not sure whether it should
remain a release blocker; IMO, the documentation can still be produced
after the release.
msg74256 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2008-10-03 16:34
Reducing priority to critical, it's just docs and tweaks from here.

 You should also support bytearray() in ntpath:
>    isinstance(path, (bytes, bytearray))

No, you shouldn't.  I changed my mind on this several times and in the
end figured it's good enough to just support bytes and str instances.

Amaury: I've reviewed your patch and ran test_ntpath.py on a Linux box.
 I get this traceback:

======================================================================
ERROR: test_relpath (__main__.TestNtpath)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "Lib/test/test_ntpath.py", line 188, in test_relpath
    tester('ntpath.relpath("a")', 'a')
  File "Lib/test/test_ntpath.py", line 22, in tester
    gotResult = eval(fn)
  File "<string>", line 1, in <module>
  File "/usr/local/google/home/guido/python/py3k/Lib/ntpath.py", line
530, in relpath
    start_list = abspath(start).split(sep)
  File "/usr/local/google/home/guido/python/py3k/Lib/ntpath.py", line
499, in abspath
    path = join(os.getcwd(), path)
  File "/usr/local/google/home/guido/python/py3k/Lib/ntpath.py", line
137, in join
    if b[:1] in seps:
TypeError: 'in <string>' requires string as left operand, not bytes
----------------------------------------------------------------------

The fix is to change the fallback abspath to this code:

    def abspath(path):
        """Return the absolute version of a path."""
        if not isabs(path):
            if isinstance(path, bytes):
                cwd = os.getcwdb()
            else:
                cwd = os.getcwd()
            path = join(cwd, path)
        return normpath(path)

Once you fix that please check it in!
msg74257 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2008-10-03 16:35
Assigning to Amaury for Windows fix first.
msg74266 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2008-10-03 18:40
Thanks for testing the non-Windows part of ntpath.
Committed patch in r66777.

Leaving the issue open: macpath.py should certainly be modified.
msg74267 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2008-10-03 18:51
Sorry Amaury, but there's another issue.

test_ntpath now fails when run with -bb:

======================================================================
ERROR: test_expandvars (__main__.TestNtpath)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "Lib/test/test_ntpath.py", line 151, in test_expandvars
    tester('ntpath.expandvars("$foo bar")', "bar bar")
  File "Lib/test/test_ntpath.py", line 10, in tester
    gotResult = eval(fn)
  File "<string>", line 1, in <module>
  File "/usr/local/google/home/guido/python/py3k/Lib/ntpath.py", line
344, in expandvars
    if c in ('\'', b'\''):   # no expansion within single quotes
BytesWarning: Comparison between bytes and string

======================================================================
ERROR: test_normpath (__main__.TestNtpath)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "Lib/test/test_ntpath.py", line 120, in test_normpath
    tester("ntpath.normpath('A//////././//.//B')", r'A\B')
  File "Lib/test/test_ntpath.py", line 10, in tester
    gotResult = eval(fn)
  File "<string>", line 1, in <module>
  File "/usr/local/google/home/guido/python/py3k/Lib/ntpath.py", line
465, in normpath
    if comps[i] in ('.', '', b'.', b''):
BytesWarning: Comparison between bytes and string

======================================================================
ERROR: test_relpath (__main__.TestNtpath)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "Lib/test/test_ntpath.py", line 188, in test_relpath
    tester('ntpath.relpath("a")', 'a')
  File "Lib/test/test_ntpath.py", line 10, in tester
    gotResult = eval(fn)
  File "<string>", line 1, in <module>
  File "/usr/local/google/home/guido/python/py3k/Lib/ntpath.py", line
534, in relpath
    start_list = abspath(start).split(sep)
  File "/usr/local/google/home/guido/python/py3k/Lib/ntpath.py", line
504, in abspath
    return normpath(path)
  File "/usr/local/google/home/guido/python/py3k/Lib/ntpath.py", line
465, in normpath
    if comps[i] in ('.', '', b'.', b''):
BytesWarning: Comparison between bytes and string
msg74268 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2008-10-03 18:54
FWIW, I don't see a need to change macpath.py -- it's only used for
MacOS 9 and the occasional legacy app.  OSX uses posixpath.py.
msg74270 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2008-10-03 20:35
Committed r66779: test_ntpath now passes with the -bb option.

It seems that the Windows buildbots do not set -bb.
msg74271 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2008-10-03 20:55
Thanks Amaury!

On to Georg for doc tweaks.  Summary:

- all the os.path functions now work on bytes as well, on all platforms
- only on Unix (but not OSX) do we recommend using bytes
- os.getcwdu() no longer exists
- os.getcwdb() returns bytes
- os.listdir(<str>) skips undecodable entries (previously it returned a
mixture of str and bytes instances)
- open() accepts bytes as filename

Stuff that didn't change but that you might want to mention:

- all the syscalls in os support bytes args; readlink() and listdir()
return bytes if the arg is bytes
- getcwd() may raise UnicodeDecodeError

Martin already documented sys.setfilesystemencoding().
msg74275 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2008-10-03 21:21
I have a patch for macpath.py nonetheless.
Tested on Windows (of course ;-) but all functions are pure text 
manipulation, except realpath(). It was much easier than ntpath.py.

I also added tests for three functions which were not exercised at all.
msg74276 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2008-10-03 21:50
Amaury, you're patch looks good.
msg74277 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2008-10-03 21:57
Committed macpath.py in r66781.
msg74409 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2008-10-06 23:03
Would it possible to close this issue since os.listdir() is fixed and 
many other related functions (posix, posixpath, ntpath, macpath, etc.) 
are also fixed? I propose to open new issues for new bugs since this 
issue becomes a little big long :)

Eg. see new issues #4035 and #4036!
msg74412 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-10-06 23:13
> Would it possible to close this issue since os.listdir() is fixed and 
> many other related functions (posix, posixpath, ntpath, macpath, etc.) 
> are also fixed?

IIUC, these fixes are still not complete: they lack documentation
changes. Of course, it would have been better if the original patches
already contained the necessary documentation and test suite changes.
See msg74271 for what Guido considers the lacking documentation;
you may find that other aspects also need documentation.

As for test cases: it seems that those got waived, in the hurry.
msg74414 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2008-10-06 23:34
Le Tuesday 07 October 2008 01:13:22 Martin v. Löwis, vous avez écrit :
> IIUC, these fixes are still not complete: they lack documentation
> changes. (...) Of course, it would have been better if the original patches 
> already contained the necessary documentation and test suite changes.

Most (or all) patches include new tests about bytes. Here is a patch for 
os.rst documentation about listdir(), getcwdb() and readlink().

> See msg74271 for what Guido considers the lacking documentation;
> you may find that other aspects also need documentation.

I wrote a long document about bytes for filenames but not only. I'm still 
waiting for some contributors or reviewers:
http://wiki.python.org/moin/Python3UnicodeDecodeError

> As for test cases: it seems that those got waived, in the hurry.

Can you be more precise? Which tests have to be improved/rewritten?
msg74426 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-10-07 07:12
> Most (or all) patches include new tests about bytes. Here is a patch for 
> os.rst documentation about listdir(), getcwdb() and readlink().

Thanks! Committed as r66829.

I've added additional documentation in r66830, which should complete
Guido's list of things to be documented. So the issue can be closed
now.

>> See msg74271 for what Guido considers the lacking documentation;
>> you may find that other aspects also need documentation.
> 
> I wrote a long document about bytes for filenames but not only. I'm still 
> waiting for some contributors or reviewers:
> http://wiki.python.org/moin/Python3UnicodeDecodeError

We should discuss that on python-dev, of course - the question is
whether additional documentation patches are needed in response to
this specific change.

>> As for test cases: it seems that those got waived, in the hurry.
> 
> Can you be more precise? Which tests have to be improved/rewritten?

I was probably looking at the wrong patches (such as getcwd_bytes.patch,
merge_os_getcwd_getcwdu.patch, etc); I now see that the final patch did
have tests. I recommend that patches that get superseded by other
patches are removed from the issue. The won't be deleted; it's still
possible to navigate to them through the History at the bottom of the
issue.
History
Date User Action Args
2008-10-07 07:12:42loewissetstatus: open -> closed
2008-10-07 07:12:22loewissetmessages: + msg74426
2008-10-06 23:34:57hayposetfiles: + library_os_doc.patch
messages: + msg74414
2008-10-06 23:13:21loewissetmessages: + msg74412
2008-10-06 23:03:18hayposetmessages: + msg74409
2008-10-03 21:57:37amaury.forgeotdarcsetmessages: + msg74277
2008-10-03 21:50:09benjamin.petersonsetmessages: + msg74276
2008-10-03 21:21:24amaury.forgeotdarcsetfiles: + macpath.patch
messages: + msg74275
2008-10-03 20:55:27gvanrossumsetassignee: amaury.forgeotdarc -> georg.brandl
messages: + msg74271
2008-10-03 20:35:15amaury.forgeotdarcsetmessages: + msg74270
2008-10-03 18:54:35gvanrossumsetresolution: accepted
messages: + msg74268
2008-10-03 18:52:02gvanrossumsetassignee: amaury.forgeotdarc
messages: + msg74267
2008-10-03 18:40:33amaury.forgeotdarcsetassignee: amaury.forgeotdarc -> (no value)
messages: + msg74266
2008-10-03 16:35:33gvanrossumsetassignee: georg.brandl -> amaury.forgeotdarc
messages: + msg74257
2008-10-03 16:34:57gvanrossumsetpriority: release blocker -> critical
messages: + msg74256
2008-10-03 16:11:02loewissetassignee: loewis -> georg.brandl
messages: + msg74255
components: + Documentation, - Library (Lib)
nosy: + georg.brandl
2008-10-03 12:23:29hayposetmessages: + msg74246
2008-10-03 11:52:56pitrousetmessages: + msg74242
2008-10-03 11:43:52hayposetmessages: + msg74241
2008-10-03 11:31:37pitrousetmessages: + msg74240
2008-10-03 10:43:51loewissetmessages: + msg74237
2008-10-03 10:38:10hayposetmessages: + msg74236
2008-10-03 01:45:43amaury.forgeotdarcsetfiles: + win32-bytes-filenames.patch
messages: + msg74222
2008-10-02 21:04:26bboissinsetnosy: + bboissin
2008-10-02 20:54:44djcsetnosy: + djc
2008-10-02 19:35:20loewissetmessages: + msg74192
2008-10-02 18:57:45gvanrossumsetassignee: gvanrossum -> loewis
messages: + msg74186
2008-10-02 17:10:54hayposetfiles: - python3_bytes_filename-2.patch
2008-10-02 17:10:49hayposetfiles: + python3_bytes_filename-3.patch
2008-10-02 16:43:48gvanrossumsetassignee: gvanrossum
messages: + msg74173
2008-10-02 12:56:28barrysetpriority: deferred blocker -> release blocker
2008-09-30 22:02:30hayposetfiles: + python3_bytes_filename-2.patch
messages: + msg74101
2008-09-30 17:04:27gvanrossumsetmessages: + msg74083
2008-09-30 15:21:16loewissetfiles: + setfsenc.diff
messages: + msg74080
2008-09-30 01:08:11hayposetfiles: + python3_bytes_filename.patch
messages: + msg74059
2008-09-29 15:58:46hayposetfiles: + os_getcwdb.patch
messages: + msg74032
2008-09-29 14:45:27draghuramsetnosy: + draghuram
2008-09-29 12:51:32hayposetfiles: + merge_os_getcwd_getcwdu.patch
messages: + msg74027
2008-09-29 05:11:59loewissetmessages: + msg74008
2008-09-29 04:47:17loewissetmessages: + msg74007
2008-09-29 04:45:30loewissetmessages: + msg74006
2008-09-29 02:51:47gvanrossumsetmessages: + msg74000
2008-09-29 00:54:36dlitzsetmessages: + msg73999
2008-09-28 21:31:29loewissetnosy: + loewis
messages: + msg73992
2008-09-27 15:12:05dlitzsetmessages: + msg73926
2008-09-27 14:50:29hayposetfiles: + getcwd_bytes.patch
messages: + msg73925
2008-09-27 11:49:42hayposetfiles: - invalid_filename.patch
2008-09-27 11:49:38hayposetfiles: - filename.py
2008-09-27 01:23:00benjamin.petersonsetmessages: + msg73911
2008-09-27 01:15:45gvanrossumsetmessages: + msg73910
2008-09-27 00:48:24benjamin.petersonsetfiles: + force_unicode.patch
2008-09-27 00:48:16benjamin.petersonsetfiles: - force_unicode.patch
2008-09-27 00:47:19benjamin.petersonsetfiles: + force_unicode.patch
messages: + msg73909
2008-09-26 22:20:09barrysetpriority: release blocker -> deferred blocker
2008-09-24 00:41:01hayposetmessages: + msg73688
2008-09-24 00:01:28gvanrossumsetmessages: + msg73680
2008-09-23 23:05:41benjamin.petersonsetfiles: + raise_decoding_errors.patch
messages: + msg73678
2008-09-21 23:48:43benjamin.petersonsetfiles: - oslistdir_string.patch
2008-09-21 23:48:36benjamin.petersonsetfiles: + warn_at_the_end.patch
2008-09-21 23:48:14benjamin.petersonsetfiles: + listdir_encoding_warning.patch
messages: + msg73540
2008-09-21 23:44:28benjamin.petersonsetfiles: - listdir_bytes_warning.patch
2008-09-21 22:05:40amaury.forgeotdarcsetmessages: + msg73535
2008-09-21 21:44:35benjamin.petersonsetfiles: + listdir_bytes_warning.patch
messages: + msg73534
2008-09-18 07:39:29HWJsetmessages: + msg73362
2008-09-18 05:43:10barrysetpriority: deferred blocker -> release blocker
2008-09-09 13:11:44barrysetpriority: release blocker -> deferred blocker
2008-09-04 12:03:25zegreeksetnosy: + zegreek
messages: + msg72495
2008-08-26 18:15:20dlitzsetmessages: + msg71991
2008-08-26 17:37:46dlitzsetnosy: + dlitz
2008-08-26 17:14:06amaury.forgeotdarclinkissue3688 superseder
2008-08-22 19:05:14gvanrossumsetmessages: + msg71769
2008-08-22 16:54:50hayposetfiles: + glob1_bytes.patch
messages: + msg71757
2008-08-22 16:53:49hayposetfiles: + fnmatch_bytes.patch
messages: + msg71756
2008-08-22 15:33:59hayposetfiles: + io_byte_filename.patch
messages: + msg71752
2008-08-22 15:19:41hayposetfiles: + posix_path_bytes.patch
messages: + msg71751
2008-08-22 14:53:29gvanrossumsetmessages: + msg71749
2008-08-22 14:21:43hayposetfiles: + invalid_filename.patch
messages: + msg71748
2008-08-21 23:59:09gvanrossumsetmessages: + msg71705
2008-08-21 23:35:51benjamin.petersonsetmessages: + msg71700
2008-08-21 23:31:58gvanrossumsetmessages: + msg71699
2008-08-21 20:55:33hayposetmessages: + msg71680
2008-08-21 16:17:46gvanrossumsetnosy: + gvanrossum
messages: + msg71655
2008-08-21 15:09:20pitrousetmessages: + msg71648
2008-08-21 14:59:54hayposetmessages: + msg71647
2008-08-21 12:55:42pitrousetmessages: + msg71629
2008-08-21 10:59:28hayposetfiles: + filename.py
messages: + msg71624
2008-08-21 08:51:11pitrousetmessages: + msg71615
2008-08-21 08:20:41hayposetmessages: + msg71612
2008-08-21 03:07:23barrysetpriority: deferred blocker -> release blocker
2008-08-20 09:37:30pitroulinkissue3616 superseder
2008-08-20 09:37:21pitrousetnosy: + haypo
messages: + msg71525
2008-08-10 00:17:16benjamin.petersonsetpriority: critical -> deferred blocker
messages: + msg70953
2008-08-09 17:35:42pitrousetpriority: critical
type: crash -> behavior
messages: + msg70943
nosy: + pitrou
2008-06-25 02:32:48benjamin.petersonsettitle: os.walk - strange bug -> os.listdir can return byte strings
2008-06-24 14:09:31HWJsetmessages: + msg68689
2008-06-24 13:37:48amaury.forgeotdarcsetmessages: + msg68688
2008-06-24 13:21:04benjamin.petersonsetfiles: + oslistdir_string.patch
keywords: + patch
messages: + msg68686
2008-06-24 13:12:52amaury.forgeotdarcsetstatus: closed -> open
resolution: not a bug -> (no value)
messages: + msg68685
2008-06-24 13:02:56benjamin.petersonsetstatus: open -> closed
resolution: not a bug
messages: + msg68684
nosy: + benjamin.peterson
2008-06-24 11:45:34amaury.forgeotdarcsetnosy: + amaury.forgeotdarc
messages: + msg68679
2008-06-24 10:28:25HWJcreate