classification
Title: Cannot access to customized paths within .pth file
Type: behavior Stage:
Components: Windows Versions: Python 3.8, Python 3.7, Python 3.6
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Valentin Zhao, Windson Yang, brett.cannon, jason.coombs, paul.moore, steve.dower, tim.golden, vstinner, zach.ware
Priority: normal Keywords: easy

Created on 2018-11-01 09:56 by Valentin Zhao, last changed 2018-11-29 14:57 by vstinner.

Files
File name Uploaded Description Edit
IMG_20181101_173328_[B@ae031df.jpg Valentin Zhao, 2018-11-01 09:56
Messages (11)
msg329050 - (view) Author: Valentin Zhao (Valentin Zhao) Date: 2018-11-01 09:56
I want to manage all the packages that I installed so every time adding package I set "--target" so the package will be downloaded there. Then I wrote the directory in a .pth file which is located in "/Python36/Lib/site-packages" so I could still get accessed to all the packages even though they are not located within "Python36" folder.

However, my current user name of Windows is a Chinese name, which means the customized path I mentioned before has Chinese characters within it, thus the .pth file will be also encoded with 'gbk'. Every time I would like to import these packages will get "UnicodeDecodeError: 'gbk' can't decode byte xxx...".

Fortunately I have found the reason and cracked the problem: python read .pth files without setting any encoding. The code is located in "Python36/Lib/site.py"

def addpackage(sitedir, name, known_paths):
    if known_paths is None:
        known_paths = _init_pathinfo()
        reset = True
    else:
        reset = False
    fullname = os.path.join(sitedir, name)
    try:
        # here should set the second param as encoding='utf-8'
        f = open(fullname, "r")
    except OSError:
        return
    # other codes

And after I doing this, everything goes well.
msg329172 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2018-11-02 23:40
Can you save your file in gbk encoding? That will be an immediate fix.

I don't know that we can/should change the encoding we read without checking with everyone who writes out .pth files. (+Jason as a start here, but I suspect there are more tools that write them.)

We could add a handler for UnicodeDecodeError that falls back on utf-8? I think that's reasonable.
msg329173 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2018-11-02 23:40
I'll mark this easy as well, since adding that handler is straightforward. Unless someone knows a reason we shouldn't do that either.
msg329178 - (view) Author: Windson Yang (Windson Yang) * Date: 2018-11-03 03:47
Hello, Valentin Zhao, do you have time to fix it? Or I can create a PR
msg329198 - (view) Author: Jason R. Coombs (jason.coombs) * (Python committer) Date: 2018-11-03 14:05
I'm only aware of one tool that writes .pth files, and that's setuptools, and it always writes ASCII (assuming package names are ASCII), so any encoding handling should be fine there.

> We could add a handler for UnicodeDecodeError that falls back on utf-8?

Yes, reasonable, but maybe we should consider instead _preferring_ UTF-8 and fall back to default encodings. That would be my preference.
msg329199 - (view) Author: Jason R. Coombs (jason.coombs) * (Python committer) Date: 2018-11-03 14:12
Also, I would argue that this is an enhancement request and not a bug - that the prior expectation was that the .pth file is encoded in whatever encoding the system expects by default, and that adding support for a standardized encoding for .pth files is a new feature.

As another aside: Valentin, the technique you're using to manage packages is likely to run into issues with certain packages - in particular any packages that rely on their own `.pth` files to invoke behavior, such as future_fstrings (https://pypi.org/project/future-fstrings/). I learned about this issue in (https://github.com/jaraco/rwt/issues/29), which is why the rwt project adds a `sitecustomize.py` to the target directory that ensures .pth files are run. Just FYI.
msg329497 - (view) Author: Valentin Zhao (Valentin Zhao) Date: 2018-11-09 06:42
I am better just waiting you guys fixing that because it is not urgent.
On Sat, Nov 3, 2018 at 10:12 PM Jason R. Coombs <report@bugs.python.org>
wrote:

>
> Jason R. Coombs <jaraco@jaraco.com> added the comment:
>
> Also, I would argue that this is an enhancement request and not a bug -
> that the prior expectation was that the .pth file is encoded in whatever
> encoding the system expects by default, and that adding support for a
> standardized encoding for .pth files is a new feature.
>
> As another aside: Valentin, the technique you're using to manage packages
> is likely to run into issues with certain packages - in particular any
> packages that rely on their own `.pth` files to invoke behavior, such as
> future_fstrings (https://pypi.org/project/future-fstrings/). I learned
> about this issue in (https://github.com/jaraco/rwt/issues/29), which is
> why the rwt project adds a `sitecustomize.py` to the target directory that
> ensures .pth files are run. Just FYI.
>
> ----------
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <https://bugs.python.org/issue35131>
> _______________________________________
>
msg329498 - (view) Author: Windson Yang (Windson Yang) * Date: 2018-11-09 06:58
I tried to create a PR for it, However, I don't know how to handle the code at https://github.com/python/cpython/blob/d4c76d960b/Lib/site.py#L159

So how to check UnicodeDecodeError when we just open the file, I use readlines() but it may use too many memory than before (I'm not sure it's important in this case).

    try:
        f = open(fullname, "r")
        data = f.readlines()
    except UnicodeDecodeError:
        f = open(fullname, "r", encoding="utf-8")
        data = f.readlines()
msg330058 - (view) Author: Jason R. Coombs (jason.coombs) * (Python committer) Date: 2018-11-18 18:42
The problem you've encountered is that previously the file was assumed to be one encoding and would fail if it was not that encoding... so it was possible to lazy-load the file and process each line.

In the new model, where you need to evaluate the viability of the file in one of two candidate encodings, you'll necessarily need to read the entire file once before processing its contents.

Therefore, I recommend one of these options:

1. Always read the file in binary mode, ascertain the "best" encoding, then rewind the file and wrap it in a TextIOWrapper for that encoding. Presumably this logic is common--perhaps there's already a routine that does just that.
2. In a try/except block, read the entire content, decoded, into another iterable ... and then have the logic below rely on that content. i.e. `f = list(f)`.
3. Always assume UTF-8 instead of the system encoding. This change would be backward incompatible, so probably isn't acceptable without at least an interim release with a deprecation warning.

I recommend a combination of (1) and then (3) in the future. That is:

def determine_best_encoding(f, encodings=('utf-8', sys.getdefaultencoding())):
    """
    Attempt to read and decode all of stream f using the encodings
    and return the first one that succeeds. Rewinds the file.
    """


f = open(..., 'rb)
encoding = determine_best_encoding(f)
if encoding != 'utf-8':
    warnings.warn("Detected pth file with unsupported encoding", DeprecationWarning)
f = io.TextIOWrapper(f, encoding)


Then, in a future version, dropping support for local encodings, all of that code can be replaced with `f = open(..., encoding='utf-8')`.
msg330113 - (view) Author: Brett Cannon (brett.cannon) * (Python committer) Date: 2018-11-19 19:52
There is not "find best encoding" code, hence why so much code out there uses chardet. :)

This might also tie into issue #33944 and the idea of rethinking .pth files.
msg330201 - (view) Author: Windson Yang (Windson Yang) * Date: 2018-11-21 13:23
I will fix this issue after we have consensus with the future of .pth file in #33944
History
Date User Action Args
2018-11-29 14:57:52vstinnersetnosy: + vstinner
2018-11-21 13:23:17Windson Yangsetmessages: + msg330201
2018-11-19 19:52:53brett.cannonsetnosy: + brett.cannon
messages: + msg330113
2018-11-18 18:42:29jason.coombssetmessages: + msg330058
2018-11-09 06:58:59Windson Yangsetmessages: + msg329498
2018-11-09 06:42:13Valentin Zhaosetmessages: + msg329497
2018-11-03 14:12:33jason.coombssetmessages: + msg329199
2018-11-03 14:05:00jason.coombssetmessages: + msg329198
2018-11-03 03:47:09Windson Yangsetnosy: + Windson Yang
messages: + msg329178
2018-11-02 23:40:58steve.dowersetkeywords: + easy

messages: + msg329173
versions: + Python 3.7, Python 3.8
2018-11-02 23:40:16steve.dowersetnosy: + jason.coombs
messages: + msg329172
2018-11-02 20:09:11ned.deilysetnosy: + paul.moore, tim.golden, zach.ware, steve.dower
components: + Windows, - Library (Lib)
2018-11-01 09:56:37Valentin Zhaocreate