classification
Title: UnicodeDecodeError in ntpath.py when home dir contains non-ascii signs
Type: behavior Stage: needs patch
Components: Documentation Versions: Python 2.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: docs@python Nosy List: Jarek.Śmiejczak, Lin.Wei, docs@python, eryksun, jhonglei, ncoghlan, rbcollins, serhiy.storchaka, steve.dower, vinay.sajip, vstinner
Priority: normal Keywords: easy

Created on 2014-01-06 10:43 by Jarek.Śmiejczak, last changed 2016-09-13 02:02 by eryksun.

Messages (12)
msg207424 - (view) Author: Jarek Śmiejczak (Jarek.Śmiejczak) Date: 2014-01-06 10:43
Full traceback:
https://gist.github.com/jarekps/2729ee1917ea372e6642

Error's starts in pip but after investigation of traceback it looks like it is python's issue (version 2.7.5).
Windows version: 8.1 Enterprise x64 with Polish language pack.
Feel free to ask if any additional information is necessary.
msg207425 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-01-06 10:47
> https://gist.github.com/jarekps/2729ee1917ea372e6642

Copy of the output:
---
C:\Users\Jarosław>pip
Traceback (most recent call last):
File "c:\python27\Scripts\pip-script.py", line 9, in <module>
load_entry_point('pip==1.5', 'console_scripts', 'pip')()
File "c:\python27\lib\site-packages\distribute-0.6.49-py2.7.egg\pkg_resources.py", line 345, in load_entry_point
return get_distribution(dist).load_entry_point(group, name)
File "c:\python27\lib\site-packages\distribute-0.6.49-py2.7.egg\pkg_resources.py", line 2381, in load_entry_point
return ep.load()
File "c:\python27\lib\site-packages\distribute-0.6.49-py2.7.egg\pkg_resources.py", line 2087, in load
entry = __import__(self.module_name, globals(),globals(), ['__name__'])
File "c:\python27\lib\site-packages\pip\__init__.py", line 11, in <module>
from pip.vcs import git, mercurial, subversion, bazaar # noqa
File "c:\python27\lib\site-packages\pip\vcs\subversion.py", line 4, in <module>
from pip.index import Link
File "c:\python27\lib\site-packages\pip\index.py", line 16, in <module>
from pip.wheel import Wheel, wheel_ext, wheel_setuptools_support
File "c:\python27\lib\site-packages\pip\wheel.py", line 23, in <module>
from pip._vendor.distlib.scripts import ScriptMaker
File "c:\python27\lib\site-packages\pip\_vendor\distlib\scripts.py", line 15, in <module>
from .resources import finder
File "c:\python27\lib\site-packages\pip\_vendor\distlib\resources.py", line 105, in <module>
cache = Cache()
File "c:\python27\lib\site-packages\pip\_vendor\distlib\resources.py", line 40, in __init__
base = os.path.join(get_cache_base(), 'resource-cache')
File "c:\python27\lib\ntpath.py", line 108, in join
path += "\\" + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xb3 in position 14: ordinal not in range(128)
---

It looks like a bug in distlib.resources, not in Python.

os.path.join() works correctly if all arguments are bytes strings (str type). I should work if all arguments are Unicode strings only containing ASCII characters. (I don't know if it works if all aruments are Unicode strings.)

In your case, it looks like os.path.join() is called with a unicode and a bytes string.
msg207428 - (view) Author: Vinay Sajip (vinay.sajip) * (Python committer) Date: 2014-01-06 11:07
It's not failing specifically because of distlib or os.path.join functionality: it's failing because, given a Unicode path C:\Users\Jarosław\..., Python is attempting to decode it using the default, ASCII codec. I'll certainly look at updating distlib to handle this case, but the same problem could bite the user in other areas.
msg207429 - (view) Author: Vinay Sajip (vinay.sajip) * (Python committer) Date: 2014-01-06 12:11
Jarek: I can't easily test this in my environment; perhaps you can help. Could you change, in the file

c:\python27\lib\site-packages\pip\_vendor\distlib\resources.py,

line 40 from

            base = os.path.join(get_cache_base(), 'resource-cache')
to
            base = os.path.join(get_cache_base(), str('resource-cache'))

to see if that resolves the problem? Currently, 'resource-cache' is a Unicode string (because of "from __future__ import unicode_literals" in the containing module) and that causes Python to try and convert the get_cache_base() result to Unicode using ASCII, which leads to the failure.
msg207603 - (view) Author: Jarek Śmiejczak (Jarek.Śmiejczak) Date: 2014-01-07 21:15
@Vinay.Sajip
After adding change you suggested i'm getting different error:
---
C:\Users\Jarosław>pip install virtualenv
Downloading/unpacking virtualenv
  Running setup.py (path:c:\users\jarosa~1\appdata\local\temp\pip_build_Jaros│a
\virtualenv\setup.py) egg_info for package virtualenv

    warning: no previously-included files matching '*' found under directory 'd
cs\_templates'
    warning: no previously-included files matching '*' found under directory 'd
cs\_build'
Cleaning up...
Exception:
Traceback (most recent call last):
  File "c:\python27\lib\site-packages\pip\basecommand.py", line 122, in main
    status = self.run(options, args)
  File "c:\python27\lib\site-packages\pip\commands\install.py", line 270, in ru

    requirement_set.prepare_files(finder, force_root_egg_info=self.bundle, bund
e=self.bundle)
  File "c:\python27\lib\site-packages\pip\req.py", line 1211, in prepare_files
    req_to_install.assert_source_matches_version()
  File "c:\python27\lib\site-packages\pip\req.py", line 451, in assert_source_m
tches_version
    % (display_path(self.source_dir), version, self))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xb3 in position 62: ordina
 not in range(128)

Traceback (most recent call last):
  File "c:\python27\Scripts\pip-script.py", line 9, in <module>
    load_entry_point('pip==1.5', 'console_scripts', 'pip')()
  File "c:\python27\lib\site-packages\pip\__init__.py", line 185, in main
    return command.main(cmd_args)
  File "c:\python27\lib\site-packages\pip\basecommand.py", line 161, in main
    text = '\n'.join(complete_log)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xb3 in position 77: ordina
 not in range(128)

C:\Users\Jarosław>
---

It looks like this needs a little more changes in pip to solve this issue.
What's strange: In Windows 8.1, name of home directory is first name saved in your Microsoft Profile (if you log via this profile of course), so it should be a pretty common issue (i think).

Thanks for your fast reaction and support.
msg219232 - (view) Author: honglei jiang (jhonglei) Date: 2014-05-27 17:16
Python:canopy-1.3.0.1715.win-x86_64\
OS:Win8.1 64

>>>directory
'F:\\Flask\\EmberJS\\\xd6\xd0\xce\xc4\\Prj\\static'
>>>os.path.isdir(directory)
True
>>>filename
u'todomvc/architecture-examples/angularjs/index.html'
>>>os.path.join(directory,filename)
Traceback (most recent call last):
  File "c:\Users\honglei\AppData\Local\Enthought\Canopy\User\Lib\site-packages\flask\helpers.py", line 1, in <module>
    # -*- coding: utf-8 -*-
  File "C:\Users\honglei\AppData\Local\Enthought\Canopy\App\appdata\canopy-1.3.0.1715.win-x86_64\Lib\ntpath.py", line 108, in join
    path += "\\" + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd6 in position 17: ordinal not in range(128)

>>>f=os.path.join(directory.decode(sys.getfilesystemencoding()),filename)
>>>os.path.isfile(f)
True
msg227696 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-09-27 16:30
This looks to me as documentation issue. Unfortunately it is not explicitly documented that os.path.join() shouldn't mix str and unicode components (except ascii-only str, such as '.').

There is relevant note in 3.x documentation. It should be adapted to 2.7.
msg234010 - (view) Author: Lin Wei (Lin.Wei) Date: 2015-01-14 07:07
The patch (http://bugs.python.org/issue9291#msg206938) for #9291 actually helps with this issue, at least for me.

By the way, @Serhiy do you mean that the problem is merely documentation, while the implementation is alright?
msg236077 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-02-15 22:53
Yes, the implementation of os.path is alright. There is a bug in distlib.resources. And the lack of os.path documentation.
msg236079 - (view) Author: Vinay Sajip (vinay.sajip) * (Python committer) Date: 2015-02-15 23:51
> There is a bug in distlib.resources.

As far as I know, this is no longer the case - a change was made in distlib.resources to get around the problem:

https://bitbucket.org/vinay.sajip/distlib/src/471427909ebbba2f4fa9f4cbc34f17bd2d31b8e3/distlib/resources.py?at=default#cl-31
msg276134 - (view) Author: Robert Collins (rbcollins) * (Python committer) Date: 2016-09-12 23:28
Given two (or more) parameters where one is unicode and one is not, upcasting will occur multiples times in path.join on windows: 
 - '\\' is str and will cast up safely in all codecs
 - the other str (or bytes) parameter will be upcast using sys.defaultencoding which is often / usually ASCII on Windows

This will then fail when the str parameter is not valid ASCII.

From this we can conclude that this is a failure to use path.join correctly: if all the parameters passed in were unicode, no error would occur as only '\\' would be getting coerced to unicode.

The interesting question is why there was a str parameter that wasn't valid ASCII; and that lies with path.expanduser() which is returning a str for the non-ascii home directory.

Changing that to return unicode rather than a no-encoding specified str when HOME or HOMEPATH etc etc contain non-ascii characters is a change that would worry me - specifically that we'd encounter code that assumes it is always str, e.g. by calling path.join(expanduser('~fred'), '\xe1\xbd\x84D') which will then blow up.

Worth noting too is that 

 expanduser(u'~user/\u14ffd')

will also blow up in the same way in the same situation - as it ends up decoding the user home path when it concatenates userhome and path[i:].

So, what to do:
 - It might be worth testing a patch that changes expanduser to decode the environment variables - I'm not sure whether we'd want the filesystemencoding or the defaultencoding for handling these environment variables. Steve Dower probably knows :).
 - Or we say 'sorry, too hard in 2.7' and move on: join *itself* is fine here, given the limits of 2.7.
msg276147 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2016-09-13 02:02
> It might be worth testing a patch that changes expanduser to 
> decode the environment variables 

If expanduser() is passed a unicode path, it can use _winreg.ExpandEvironmentStrings(u'%USERPROFILE%') instead of decoding os.environ['USERPROFILE']. In 2.7, os.environ is a lossy ANSI encoding of the native Unicode environment block.
History
Date User Action Args
2016-09-13 02:02:25eryksunsetnosy: + eryksun
messages: + msg276147
2016-09-12 23:28:56rbcollinssetnosy: + rbcollins, steve.dower
messages: + msg276134
2015-02-15 23:51:18vinay.sajipsetmessages: + msg236079
2015-02-15 22:53:08serhiy.storchakasetmessages: + msg236077
2015-01-14 07:07:25Lin.Weisetnosy: + Lin.Wei
messages: + msg234010
2014-09-27 16:30:33serhiy.storchakasetassignee: docs@python
type: crash -> behavior
components: + Documentation, - Windows

keywords: + easy
nosy: + serhiy.storchaka, docs@python
messages: + msg227696
stage: needs patch
2014-05-27 17:16:23jhongleisetnosy: + jhonglei
messages: + msg219232
2014-01-07 21:15:45Jarek.Śmiejczaksetmessages: + msg207603
2014-01-06 12:11:00vinay.sajipsetmessages: + msg207429
2014-01-06 11:07:58vinay.sajipsetmessages: + msg207428
2014-01-06 10:47:29vstinnersetnosy: + vinay.sajip, vstinner, ncoghlan
messages: + msg207425
2014-01-06 10:43:51Jarek.Śmiejczakcreate