This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author vstinner
Recipients HWJ, amaury.forgeotdarc, benjamin.peterson, gvanrossum, pitrou, vstinner
Date 2008-08-21.20:55:31
SpamBayes Score 6.4948047e-15
Marked as misclassified No
Message-id <200808212255.18585.victor.stinner@haypocalc.com>
In-reply-to <1219335466.91.0.576198067043.issue3187@psf.upfronthosting.co.za>
Content
Le Thursday 21 August 2008 18:17:47 Guido van Rossum, vous avez écrit :
> The proper work-around is for the app to pass bytes into os.listdir();
> then it will return bytes.

In my case, I just would like to remove a directory with shutil.rmtree(). I 
don't know if it contains bytes or characters filenames :-)

> It would be nice if open() etc. accepted 
> bytes (as well as strings of course), at least on Unix, but not
> absolutely necessary -- the app could also just know the right encoding.

An invalid filename has no charset. It's just a "raw" byte string. So open(), 
unlink(), etc. have to accept byte string. Maybe not in the Python version 
with in low level (C version)?

> I see two reasonable alternatives for what os.listdir() should return
> when the input is a string and one of the filenames can't be decoded:
> either omit it from the output list;

It's not a good option: rmtree() will fails because the directory in not 
empty :-/

> or use errors='replace' in the encoding.

It will also fails because filenames will be invalid (valid unicode string but 
non existent file names :-/).

> Failing the entire os.listdir() call is not acceptable, and 
> neither is returning a mixture of str and bytes instances.

Ok, I have another suggestion:
 - *by default*, listdir() only returns str and raise an error (TypeError?) 
   on invalid filename
 - add an optional argument (a callback), eg. "fallback_encoder", to catch
   such errors (similar to "onerror" from shutils.rmtree())

Example of new listdir implementation (pseudo-code):

   charset = sys.getfilesystemcharset()
   dirobj = opendir(path)
   try:
      for bytesname in readdir(dirobj):
          try:
              name = str(bytesname, charset)
          exept UnicodeDecodeError:
              name = fallback_encoder(bytesname)
          yield name
   finally:
      closedir(dirobj)

The default fallback_encoder:

   def fallback_encoder(name):
      raise

Keep raw bytes string:

   def fallback_encoder(name):
      return name

Create my custom filename object:

   class Filename:
      ...

   def fallback_encoder(name):
      return Filename(name)

If a callback is overkill, we can just add an option, 
eg. "keep_invalid_filename=True", to ask listdir() to keep bytes string if 
the conversion to unicode fails.

In any case, open(), unlink(), etc. have to accept byte string to be accept to 
read, copy, remove invalid filenames. In a perfect world, all filenames would 
be valid UTF-8 strings, but in the real world (think to Matrix :-)), we have 
to support such strange cases...
History
Date User Action Args
2008-08-21 20:55:34vstinnersetrecipients: + vstinner, gvanrossum, amaury.forgeotdarc, pitrou, benjamin.peterson, HWJ
2008-08-21 20:55:33vstinnerlinkissue3187 messages
2008-08-21 20:55:32vstinnercreate