classification
Title: os.listdir() returns unusable bytes result on Windows
Type: behavior Stage: test needed
Components: Extension Modules, Unicode, Windows Versions: Python 3.2, Python 3.3, Python 3.4, Python 2.7
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: Nosy List: amaury.forgeotdarc, eric.araujo, ezio.melotti, jkloth, larry, loewis, pitrou, r.david.murray, serhiy.storchaka, techtonik, vstinner
Priority: normal Keywords: patch

Created on 2012-12-10 11:41 by techtonik, last changed 2012-12-16 16:25 by pitrou. This issue is now closed.

Files
File name Uploaded Description Edit
tests.py techtonik, 2012-12-12 11:47
python2.out.txt techtonik, 2012-12-12 11:47
python3.out.txt techtonik, 2012-12-12 11:47
test_unicode_fname.py serhiy.storchaka, 2012-12-12 16:19 Test listdir(), stat() and walk() on Unicode filenames
py27fname.log techtonik, 2012-12-13 22:26
py33fname.log techtonik, 2012-12-13 22:26
listdir_unicode-2.7.patch serhiy.storchaka, 2012-12-14 21:02 review
Messages (37)
msg177276 - (view) Author: anatoly techtonik (techtonik) Date: 2012-12-10 11:41
This critical bug is one of the reasons that non-English speaking communities doesn't adopt Python as broadly as it happens in English world compared to other technologies (PHP etc.). 


# -*- coding: utf-8 -*-

import os

os.mkdir(u'Русское имя')
os.mkdir(u'English name')

for r, dirs, files in os.walk('.'):
  print dirs


This gives:
['English name']
[]


Windows Vista.
>dir /b
English name
test.py
Русское имя
msg177279 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-12-10 11:49
Is it reproduced on 3.x?
msg177281 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-12-10 12:28
No.
msg177282 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-12-10 12:30
Oops, clicked submit too soon.

It isn't likely to get fixed in 2.7, because 2.7's unicode support problems is the major reason python3 was developed.
msg177283 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-12-10 12:34
For that matter, it isn't reproduced in python2.7, either:

>>> for r, dirs, files in os.walk(u'.'):
...   print dirs
... 
[u'\u0420\u0443\u0441\u0441\u043a\u043e\u0435 \u0438\u043c\u044f']
[]
msg177284 - (view) Author: Jeremy Kloth (jkloth) * Date: 2012-12-10 12:34
The problem exhibited is not coming from the os.walk() implementation, but from the use of a byte-string as the argument to it.

The directories are created with unicode literals and therefore the argument must also be a unicode literal (u'.') for them to be shown.  See the note in the listdir() documentation.

As it stands, I suggest that this is closed as invalid, or at the minimum that it could be a documentation bug for walk() not also having a similar note as listdir().
msg177285 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-12-10 12:39
Works for me without the u'.', too, though less usefully:

>>> for r, dirs, files in os.walk('.'):
...   print dirs
... 
['\xd0\xa0\xd1\x83\xd1\x81\xd1\x81\xd0\xba\xd0\xbe\xd0\xb5 \xd0\xb8\xd0\xbc\xd1\x8f']

Maybe that doesn't work on Windows, though.  I am, of course, assuming that python3 does the right thing on Windows, but I can't imagine Victor would have overlooked that.
msg177331 - (view) Author: anatoly techtonik (techtonik) Date: 2012-12-11 10:51
In Python 3 it fails with UnicodeEncodeError in "C:\Python33\lib\encodings\cp437.py", while Vista's 'dir' command shows everything correctly in the same console, so somebody definitely overlooked that aspect.

This bug is clearly an issue for developers who write products for international markets. It is neither out of date, nor it is invalid. Note in documentation in red is a must have, also a warning should be issued in warning mode when os.walk() ignores international dirs. I doubt there are many people who aware of this racist behavior and want it be default.
msg177333 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2012-12-11 11:09
- Do you have a full traceback of the failing os.walk() in Python3.3?
- What's the result of os.listdir(u'.') ?
msg177335 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-12-11 11:24
What are the results of os.listdir(b'.') and os.listdir(u'.') on Python 2.7 and Python 3.3+?

What are the results of os.stat(b'Русское имя') and os.stat(b'Русское имя') on Python 2.7 and Python 3.3+?

What are the results of sys.getdefaultencoding(), sys.getfilesystemencoding(), locale.getpreferredencoding(False) and locale.getpreferredencoding(True) on Python 2.7 and Python 3.3+?

If any of those calls fail, please provide a full traceback.
msg177339 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-12-11 14:35
My guess is that your unicode issue is issue 1602, which is non-trivial to solve.
msg177343 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-12-11 16:56
> My guess is that your unicode issue is issue 1602, which is non-trivial to solve.

In such case the output will be something like:

['English name', '']
[]
[]
msg177370 - (view) Author: anatoly techtonik (techtonik) Date: 2012-12-12 11:04
>
> - Do you have a full traceback of the failing os.walk() in Python3.3?
>

Traceback (most recent call last):
  File "test.py", line 9, in <module>
    print(dirs)
  File "C:\Python33\lib\encodings\cp437.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position
18-24: character maps to <undefined>

> - What's the result of os.listdir(u'.') ?
>

>python3 -c "import os; print(os.listdir(u'.'))"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\Python33\lib\encodings\cp437.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position
41-47: character maps to <undefined>

>python2 -c "import os; print(os.listdir(u'.'))"
[u'English name', u'test.py', u'test2.py',
u'\u0420\u0443\u0441\u0441\u043a\u043e\u0435 \u0438\u043c\u044f']

>python2 -c "import os; print(os.listdir('.'))"
['English name', 'test.py', 'test2.py', '??????? ???']
msg177373 - (view) Author: anatoly techtonik (techtonik) Date: 2012-12-12 11:46
I attach tests.py file used to run the tests. Results are in python2.out.txt and python3.out.txt also attached.

> What are the results of os.stat(b'Русское имя') and os.stat(b'Русское имя') on Python 2.7 and Python 3.3+?

b'Русское имя' is not a valid syntax construct in Python 3 even though I have correct 'coding: utf-8' header and expect characters to be utf-8 bytes. Therefore I skipped this test for Python 3.
> python test.py
  File "tests.py", line 23
    print(os.stat(b'\u0420\u0443\u0441\u0441\u043a\u043e\u0435 \u0438\u043c\u044f'))
                   ^
SyntaxError: bytes can only contain ASCII literal characters.
msg177374 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-12-12 11:54
Thank you, Anatoly, for report. I'll try to investigate this issue.
msg177375 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2012-12-12 12:28
So, it seems that os.walk() and os.listdir() work correctly with Python3.3, but print(u'Русское имя') fails because the terminal encoding is cp437.

See issue1602 for the print issue.
As a quick workaround, try to set PYTHONIOENCODING=cp437:backslashreplace as suggested in http://wiki.python.org/moin/PrintFails

If nothing is wrong with os.walk() and os.listdir(), this issue should be closed.
msg177377 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-12-12 13:19
Anatoly, can you please run the attached test?
msg177378 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-12-12 13:22
Based on the pasted results I'm pretty sure there's nothing wrong with walk and listdir.  But it sounds like Serhiy will check to make sure, so we'll wait for his report.

The byte string vs the coding cookie is an interesting observation, but is a separate issue and should probably be raised on python-ideas, since I'm guessing it the current behavior was a conscious design choice.
msg177416 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2012-12-13 16:45
Anatoly
> b'Русское имя' is not a valid syntax construct in Python 3 even though I have
> correct 'coding: utf-8' header and expect characters to be utf-8 bytes.

David
> The byte string vs the coding cookie is an interesting observation, but is a separate
> issue and should probably be raised on python-ideas, since I'm guessing it the
> current behavior was a conscious design choice.

Yes, it works as designed: the coding cookie is used to decode bytes to characters in unicode literals (e.g. if I have u'Éric' in my source file, not a \u escape); bytes literals are independent of the coding cookie and should always contain only bytes, not characters (including \u escapes), e.g. '\xc3\x89ric' for UTF-8 bytes.
msg177438 - (view) Author: anatoly techtonik (techtonik) Date: 2012-12-13 22:25
There is one more problem - when I redirect the output with:

> py test_unicode_fname.py > test.log 2>&1

In Python 2.7 the traceback is at the end of file, in Python 3.3 it is at the beginning. Therefore I just copied data from the screen, where it appears in correct order.

(current mood: Python debugging is a mess)
msg177439 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2012-12-13 23:10
Anatoly, please file another issue for the 2>&1 mess.
msg177468 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-12-14 16:52
Thanks, Anatoly. I see an actual bug. FindFirstFile and FindNextFile return broken name if file unicode name can't be represented in current codepage.

I don't know what is perfect solution for this issue.

On 2.7 we can decode listdir() argument to unicode and then encode result names to str with sys.getfilesystemencoding() only if it is possible. Therefore listdir() with str argument will return unicode for non-encodable names. This should not make many new problems in addition to those which 2.7 already have with Unicode.

But on 3.x listdir() with bytes argument can returns only bytes objects. I don't know what to do with non-encodable names in such case. Perhaps an exception should be raised. Fortunately listdir() with bytes argument is rarely used on 3.x.
msg177472 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-12-14 17:11
That's what surrogateescape is for, on linux.  I thought Victor dealt with this a different way in Windows.  Maybe by deprecating the bytes interface :)
msg177478 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-12-14 17:48
Surrogateescape is for non-decodable names. Here we have a problem with non-encodable names.

I know that naive approach with using only Unicode API inside is not work because Windows use complex logic for filename encoding (for example dropping diacritics). Perhaps Martin have more to say.
msg177492 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-12-14 18:46
Ah, I misunderstood your comment.

So, listdir is returning the "correct" the filename, it's just that we can't encode it to the console encoding.  So, it is working as expected within the current windows console limitations, if not in a particularly useful fashion.

(That is, listdir/os.walk are *not* ignoring the international dirs.)
msg177494 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-12-14 19:00
> Ah, I misunderstood your comment.

Ah, you misunderstood my comment right now.

> So, listdir is returning the "correct" the filename, it's just that we can't encode it to the console encoding.

listdir() returns already irremediably broken filename (all Cyrillic
letters replaced with '?'). My test script outputs only ascii data, you
see literally what you get, there is no output encoding issues.
msg177496 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-12-14 19:13
Oh, I remember Victor complaining about that behavior of Windows.  I'm pretty sure it is the windows API and not python that is doing that mangling.  But Victor would know for sure.
msg177503 - (view) Author: Larry Hastings (larry) * (Python committer) Date: 2012-12-14 20:14
I'm a little confused.  FindFirstFile is an ANSI API, so we get a narrow string back.  We call PyBytes_FromString(), which expects a narrow string and returns a bytes object.  Who's trying (and failing) to encode the filename?
msg177508 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-12-14 21:02
> Who's trying (and failing) to encode the filename?

Windows. File created using Unicode API and stored UTF-16 encoded in
NTFS. Windows fails to represent this filename using ANSI API.

Here is a patch against 2.7 which always uses Unicode API in listdir()
and tries to encode filenames to str if str argument used.
msg177509 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-12-14 21:05
I can't test (and even compile) the patch as I don't have a Windows,
please test it for me.
msg177511 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-12-14 21:21
On Windows with Python 2, unencodable characters are replaced with "?". It is the default behaviour of WideCharToMultiByte() and so all ANSI functions have this behaviour. Python doesn't try to behave differently, it just exposes system function as Python functions.

So for example, os.listdir(bytes) returns filename with "?" if some characters are not encodable to the ANSI codepage. It's a choice in the design of Windows.

> This critical bug is one of the reasons that non-English speaking
> communities doesn't adopt Python as broadly as it happens in
> English world compared to other technologies (PHP etc.).

I don't understand this point.

PHP doesn't have a Unicode type, I'm quite sure that PHP have exactly the same issue. And this issue is only solved in Python 3... except if you explicitly uses a bytes filename (for os.listdir/os.walk), but the bytes filename API has been deprecated in Python 3.3.

In Python 2, you can use Unicode filenames to workaround this issue. But it doesn't work as well as Python 3: on UNIX, you will get a similar issue with undecodable filenames (which is the opposite of unencodable filenames).

Read my book for more information: https://github.com/haypo/unicode_book/wiki

--

About listdir_unicode-2.7.patch: Python chose to work as Windows with unencodable characters. If you want to change the behaviour, you must change *all* calls to the Windows ANSI API (which is not trivial). Anyway, as I wrote, the bytes API is deprecated for filenames in Python 3.3. I prefer to not change anything in Python 2, because it may break existing applications. For example, os.listdir(bytes) doesn't fail in Python 2.7 with unencodable names, whereas it fails with your patch.

Nothing interesting in this issue, I'm closing it. If your consider the redirection issue important, please open a new issue.
msg177512 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-12-14 21:23
> And this issue is only solved in Python 3...

Ooops, I mean: this issue is *already* solved in Python 3
msg177514 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-12-14 21:33
> For example, os.listdir(bytes) doesn't fail in Python 2.7 with unencodable names, whereas it fails with your patch.

No. The purpose of this patch is that it doesn't fail and should return
a usable result.
msg177526 - (view) Author: anatoly techtonik (techtonik) Date: 2012-12-15 01:14
haypo didn't understood the main user story for this ticket and closed it, so I reopen it with simplified user story.

"""As a developer, I want Python os.listdir('.') function return all directories in current directory on Windows, not matter how international they are, in the same way as other Windows applications return them.

I want returned name to be normal string, but reuseable in subsequent functions, so that I can query stats for this dir or CD to it. It doesn't matter how this name is binary represented. It can be quoted, in UTF-8 byte code or whatever - I don't care as long as I can access the same name from the same session."""
msg177534 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2012-12-15 12:45
Anatoly,
- in Python2.7, try this:
    print repr(os.listdir(u'.'))
- in Python3, try this:
    print(ascii(os.listdir('.')))
Do the commands above work correctly?
msg177609 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-12-16 16:15
Looks as this is a duplicate of issue13247. And Victor submitted a patch with approach similar to me (except that my patch does not raises an exception, but returns an unicode for unencodable names).

It looks as longliving design bug and perhaps really be closed as "won't fix".
msg177610 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-12-16 16:25
Indeed, os.listdir() should always be called with a unicode argument under Windows (which is the natural thing to do in 3.x, anyway). The other issue (with print() unable to display some symbols depending on the codepage) is unrelated.
History
Date User Action Args
2012-12-16 16:25:00pitrousetstatus: open -> closed

nosy: + pitrou
messages: + msg177610

resolution: wont fix
2012-12-16 16:15:45serhiy.storchakasetmessages: + msg177609
2012-12-15 12:45:19amaury.forgeotdarcsetmessages: + msg177534
2012-12-15 01:14:53techtoniksetstatus: closed -> open
resolution: wont fix -> (no value)
messages: + msg177526
2012-12-14 21:33:34serhiy.storchakasetmessages: + msg177514
2012-12-14 21:23:59vstinnersetmessages: + msg177512
2012-12-14 21:21:55vstinnersetresolution: wont fix
2012-12-14 21:21:47vstinnersetstatus: open -> closed

messages: + msg177511
2012-12-14 21:05:22serhiy.storchakasetmessages: + msg177509
2012-12-14 21:02:07serhiy.storchakasetfiles: + listdir_unicode-2.7.patch
keywords: + patch
messages: + msg177508
2012-12-14 20:14:54larrysetmessages: + msg177503
2012-12-14 19:13:41r.david.murraysetmessages: + msg177496
2012-12-14 19:03:57serhiy.storchakasettitle: os.walk ignores international dirs on Windows -> os.listdir() returns unusable bytes result on Windows
2012-12-14 19:00:58serhiy.storchakasetmessages: + msg177494
2012-12-14 18:46:52r.david.murraysetmessages: + msg177492
2012-12-14 17:48:36serhiy.storchakasetmessages: + msg177478
2012-12-14 17:11:36r.david.murraysetnosy: + vstinner
messages: + msg177472
2012-12-14 16:52:00serhiy.storchakasetnosy: + larry, ezio.melotti, loewis
messages: + msg177468
components: + Extension Modules, Unicode, Windows, - Library (Lib)
2012-12-13 23:10:54amaury.forgeotdarcsetmessages: + msg177439
2012-12-13 22:26:09techtoniksetfiles: + py33fname.log
2012-12-13 22:26:01techtoniksetfiles: + py27fname.log
2012-12-13 22:25:44techtoniksetmessages: + msg177438
2012-12-13 16:45:16eric.araujosetnosy: + eric.araujo
messages: + msg177416
2012-12-12 16:19:40serhiy.storchakasetfiles: - test_unicode_fname.py
2012-12-12 16:19:06serhiy.storchakasetfiles: + test_unicode_fname.py
2012-12-12 13:22:18r.david.murraysetmessages: + msg177378
2012-12-12 13:19:26serhiy.storchakasetfiles: + test_unicode_fname.py

messages: + msg177377
2012-12-12 12:28:20amaury.forgeotdarcsetmessages: + msg177375
2012-12-12 11:54:56serhiy.storchakasetmessages: + msg177374
2012-12-12 11:47:31techtoniksetfiles: + python3.out.txt
2012-12-12 11:47:22techtoniksetfiles: + python2.out.txt
2012-12-12 11:47:10techtoniksetfiles: + tests.py
2012-12-12 11:46:56techtoniksetmessages: + msg177373
2012-12-12 11:04:15techtoniksetmessages: + msg177370
2012-12-11 16:56:45serhiy.storchakasetmessages: + msg177343
2012-12-11 14:35:12r.david.murraysetmessages: + msg177339
2012-12-11 11:24:34serhiy.storchakasetmessages: + msg177335
stage: resolved -> test needed
2012-12-11 11:09:25amaury.forgeotdarcsetstatus: pending -> open
nosy: + amaury.forgeotdarc
messages: + msg177333

2012-12-11 10:51:17techtoniksetstatus: closed -> pending
resolution: not a bug -> (no value)
messages: + msg177331
2012-12-10 12:39:13r.david.murraysetmessages: + msg177285
2012-12-10 12:34:55jklothsetnosy: + jkloth
messages: + msg177284
2012-12-10 12:34:17r.david.murraysetresolution: out of date -> not a bug
messages: + msg177283
2012-12-10 12:30:33r.david.murraysetmessages: + msg177282
stage: resolved
2012-12-10 12:28:56r.david.murraysetstatus: open -> closed

nosy: + r.david.murray
messages: + msg177281

resolution: out of date
2012-12-10 11:49:29serhiy.storchakasetmessages: + msg177279
2012-12-10 11:49:13serhiy.storchakasetmessages: - msg177278
2012-12-10 11:48:49serhiy.storchakasetversions: - Python 3.1
nosy: + serhiy.storchaka

messages: + msg177278

type: behavior
2012-12-10 11:41:44techtonikcreate