Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Py_Initialize() and Py_Main() should not enable C locale coercion #78770

Closed
vstinner opened this issue Sep 5, 2018 · 38 comments
Closed

Py_Initialize() and Py_Main() should not enable C locale coercion #78770

vstinner opened this issue Sep 5, 2018 · 38 comments
Labels
3.7 (EOL) end of life 3.8 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) type-bug An unexpected behavior, bug, or error

Comments

@vstinner
Copy link
Member

vstinner commented Sep 5, 2018

BPO 34589
Nosy @ncoghlan, @vstinner, @ned-deily, @koobs
PRs
  • bpo-34589: C locale coercion off by default #9073
  • WIP bpo-34589: Move locale coercion back to the CLI app #9257
  • bpo-34589: Make _PyCoreConfig.coerce_c_locale private #9371
  • bpo-34589: Add -X coerce_c_locale command line option #9378
  • [3.7] bpo-34589: Add -X coerce_c_locale option; C locale coercion off by default #9379
  • bpo-34589: Add test_PYTHONCOERCECLOCALE_ignored() #9382
  • [3.7] Revert "[3.7] bpo-34589: Add -X coerce_c_locale option; C locale coercion off by default (GH-9379)" #9416
  • Revert "bpo-34589: Add -X coerce_c_locale command line option (GH-9378)" #9430
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2018-12-19.10:48:18.727>
    created_at = <Date 2018-09-05.19:36:35.051>
    labels = ['interpreter-core', '3.8', 'type-bug', '3.7']
    title = 'Py_Initialize() and Py_Main() should not enable C locale coercion'
    updated_at = <Date 2019-04-07.11:58:09.393>
    user = 'https://github.com/vstinner'

    bugs.python.org fields:

    activity = <Date 2019-04-07.11:58:09.393>
    actor = 'ncoghlan'
    assignee = 'none'
    closed = True
    closed_date = <Date 2018-12-19.10:48:18.727>
    closer = 'ncoghlan'
    components = ['Interpreter Core']
    creation = <Date 2018-09-05.19:36:35.051>
    creator = 'vstinner'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 34589
    keywords = ['patch']
    message_count = 38.0
    messages = ['324649', '325008', '325009', '325010', '325093', '325097', '325098', '325242', '325246', '325256', '325282', '325493', '325554', '325588', '325597', '325600', '325605', '325606', '325608', '325709', '325710', '325712', '325713', '325714', '325723', '325748', '325783', '325806', '325865', '325866', '325948', '325984', '329351', '330378', '330380', '332123', '339476', '339574']
    nosy_count = 4.0
    nosy_names = ['ncoghlan', 'vstinner', 'ned.deily', 'koobs']
    pr_nums = ['9073', '9257', '9371', '9378', '9379', '9382', '9416', '9430']
    priority = 'normal'
    resolution = 'postponed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue34589'
    versions = ['Python 3.7', 'Python 3.8']

    @vstinner
    Copy link
    Member Author

    vstinner commented Sep 5, 2018

    According to Nick Coghlan, author of the PEP-538, the C locale coercion should not be enabled by Py_Initialize() and Py_Main(), only by the python3 program.

    Attached PR fix this issue.

    @vstinner vstinner added 3.7 (EOL) end of life 3.8 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) labels Sep 5, 2018
    @ncoghlan
    Copy link
    Contributor

    The entire change affecting the PEP-538 implementation in 9454060#diff-8c018c3ada66d06c8e101e47a313c2c7 needs to be reverted: the locale should be coerced before *ANY* calls are made to Py_DecodeLocale, as the whole point of the architecture of the PEP implementation was to ensure that *nothing ever gets decoded incorrectly in the first place*.

    I just initially missed that in Victor's enthusiasm for fixing the incorrect decodings that can arise in the absence of locale coercion he'd *introduced* incorrect decoding into the locale coercion case.

    @ncoghlan
    Copy link
    Contributor

    (The one exception to "nothing gets decoded incorrectly" is that PYTHONCOERCECLOCALE itself is always interpreted as an ASCII field: the values that it checks for are actually ASCII byte sequences, not Unicode code points.

    The documentation could definitely be much clearer on that point though, as even in the PEP it's only implied by the final paragraph in https://www.python.org/dev/peps/pep-0538/#legacy-c-locale-coercion-in-the-standalone-python-interpreter-binary which is mostly talking about the fact that this particular environment variable is still checked, even if you pass the -I or -E command line options.

    @ncoghlan
    Copy link
    Contributor

    FWIW, this is also why I don't want PEP-432 to expose any wstr fields in the configuration settings: it means embedding applications have to figure out which encoding Python is expecting to be used for those fields.

    Everything should be much cleaner if the public API requires those settings to be provided as Py_Unicode objects, and the core configuration step takes care of getting the C API to the point where the embedding application can create Py_Unicode objects for passing to later steps.

    The fact that CPython itself currently uses wstr to configure these settings should be considered a legacy implementation detail that we do our best to hide from the public API.

    @vstinner
    Copy link
    Member Author

    (...) the locale should be coerced before *ANY* calls are made to Py_DecodeLocale, as the whole point of the architecture of the PEP implementation was to ensure that *nothing ever gets decoded incorrectly in the first place*.

    See my comment:
    #9073 (comment)

    @vstinner
    Copy link
    Member Author

    (...) The documentation could definitely be much clearer on that point though, as even in the PEP it's only implied by the final paragraph in https://www.python.org/dev/peps/pep-0538/#legacy-c-locale-coercion-in-the-standalone-python-interpreter-binary which is mostly talking about the fact that this particular environment variable is still checked, even if you pass the -I or -E command line options.

    That's a separated issue: I created bpo-34639.

    @vstinner
    Copy link
    Member Author

    Example of C locale coercion with Python 3.7:

    $ env -i PYTHONCOERCECLOCALE=0 ./python -X utf8=0 -c 'import sys; print(ascii(sys.argv[1]))' ''
    'h\udcc3\udca9'
    
    $ env -i PYTHONCOERCECLOCALE=1 ./python -X utf8=0 -c 'import sys; print(ascii(sys.argv[1]))' ''
    'h\xe9'

    It works as expected no? What's wrong with the current implementation according to you, Nick?

    @ncoghlan
    Copy link
    Contributor

    Reviewing the way things are now, I don't think it's worth rearranging this yet again for 3.7 - instead, I'll just put it back the way I intended it to be (and the way the PEP describes) for 3.8.

    The issue is that Victor had good reasons for moving the handling of the "PYTHONCOERCECLOCALE=warn" case to be after the low level C stdio streams were properly initialised, and retaining that aspect of the changes while still moving the C locale coercion back to the beginning of the program means changing the signature of _Py_CoerceLegacyLocale to return a stateless static string to be printed to stderr later.

    @ncoghlan
    Copy link
    Contributor

    The actual functional error is that the following will currently give different outputs on Fedora and CentOS 7, whereas in the original PEP-538 implementation it would always print "C", even if locale coercion would otherwise normally work on your current platform:

        LC_CTYPE=C PYTHONCOERCECLOCALE=0 python3 -E -c 'import os; print(os.env["LC_CTYPE"])'

    As per the comment that was deleted in 9454060#diff-8c018c3ada66d06c8e101e47a313c2c7, that was an intentional behaviour to allow developers testing scripts that are invoked using -E and -I to pretend that their local C.UTF-8 locale didn't exist when trying to reproducing behaviour otherwise seen only when locale coercion fails due to the lack of a suitable coercion target.

    I'm working on an alternate PR that restores that aspect of the original behaviour, while preserving all the other improvements you made (such as ensuring that emitting the warning gets delayed until after the C level stderr is properly initialised)

    @ncoghlan
    Copy link
    Contributor

    Alternate PR is up at #9257

    There's one setting that remains in CoreConfig: "warn_on_c_locale", which _Py_UnixMain also uses to decide whether or not to actually emit the passed in locale coercion warning.

    The coercion warning information itself moves out of CoreConfig, and into the private _PyMain helper struct.

    The signature of _Py_CoerceLegacyLocale changed to both return an integer indicating whether or not coercion took place, as well as supporting two output variables that allow the display of the coercion warning message to be deferred (or skipped entirely).

    @vstinner
    Copy link
    Member Author

    LC_CTYPE=C PYTHONCOERCECLOCALE=0 python3 -E ...

    The UTF-8 Mode has the same behavior: PYTHONUTF8 env var is ignored when using -E, but enabled by the C locale: you can use -X utf8=0 in that case to ensure that the UTF-8 Mode is disabled.

    I wanted to propose you to add a new -X option, ex: -X coerce_c_locale=0, which would have the priority over the environment variable. I'm not a big fan of env vars, it's not always convenient to use them. I prefer an env var, but both are even better :-)

    @ncoghlan
    Copy link
    Contributor

    I'd strongly prefer to just go back to the PEP-538 design. It's much simpler to implement, we don't actually want anyone turning off locale coercion except for debugging purposes (unlike UTF-8 mode), and the only argument against doing this the way PEP-538 describes is a purist one, not a practical one (which was already resolved in favour of practicality when PEP-538 was accepted).

    However, if you were willing to do the updates on my PR branch to implement it, then I'd accept an alternative that added a pass through the command line arguments that checks for at least one of the two following cases after a legacy locale has been detected:

    • the new option "-X coerce_c_locale=0" (ASCII) is present in the arg list
    • neither "-E" nor "-I" (ASCII) are present in the arg list, and PYTHONCOERCECLOCALE is not set to zero

    The code that sets warn_on_c_locale in the core config would then look for -X coerce_c_locale=warn in addition to looking for PYTHONCOERCECLOCALE=warn.

    That's quite a bit of code to add for the sake of a flag we don't really want anyone to ever use, though. (If it hadn't been for the debugging-CentOS-7-problems-on-Fedora issue, I doubt I would have included the off switch in PEP-538 at all)

    @vstinner
    Copy link
    Member Author

    I'd strongly prefer to just go back to the PEP-538 design. It's much simpler to implement, we don't actually want anyone turning off locale coercion except for debugging purposes (unlike UTF-8 mode), and the only argument against doing this the way PEP-538 describes is a purist one, not a practical one (which was already resolved in favour of practicality when PEP-538 was accepted).
    (...)
    That's quite a bit of code to add for the sake of a flag we don't really want anyone to ever use, though. (If it hadn't been for the debugging-CentOS-7-problems-on-Fedora issue, I doubt I would have included the off switch in PEP-538 at all)

    Why did you add PYTHONCOERCECLOCALE=warn if you don't want that people use it?

    The actual functional error is that the following will currently give different outputs on Fedora and CentOS 7, whereas in the original PEP-538 implementation it would always print "C", even if locale coercion would otherwise normally work on your current platform:

    LC_CTYPE=C PYTHONCOERCECLOCALE=0 python3 -E -c 'import os; print(os.env["LC_CTYPE"])'

    You wrote that we need PYTHONCOERCECLOCALE=0 and then that we don't need it, I don't understand :-)

    IMHO there are use cases where PYTHONCOERCECLOCALE=0 is important, that's why I propose to add -X coerce_c_locale=0.

    There are also cases where you want the warning, so -X coerce_c_locale=warn would help as well, when -E or -I is needed.

    @vstinner
    Copy link
    Member Author

    New changeset 188ebfa by Victor Stinner in branch 'master':
    bpo-34589: Make _PyCoreConfig.coerce_c_locale private (GH-9371)
    188ebfa

    @vstinner
    Copy link
    Member Author

    New changeset 7a0791b by Victor Stinner in branch 'master':
    bpo-34589: C locale coercion off by default (GH-9073)
    7a0791b

    @vstinner
    Copy link
    Member Author

    New changeset dbdee00 by Victor Stinner in branch 'master':
    bpo-34589: Add -X coerce_c_locale command line option (GH-9378)
    dbdee00

    @vstinner
    Copy link
    Member Author

    New changeset 144f1e2 by Victor Stinner in branch '3.7':
    [3.7] bpo-34589: Add -X coerce_c_locale option; C locale coercion off by default (GH-9379)
    144f1e2

    @ned-deily
    Copy link
    Member

    I'm marking this as a 3.7.1 "release blocker" until Nick has had a chance to review the merged PR and he and Victor are in agreement that this is a satisfactory approach for 3.7.1.

    @vstinner
    Copy link
    Member Author

    test_c_locale_coercion didn't test the -E option: I wrote PR 9382 to test it.

    @ned-deily ned-deily added the 3.7 (EOL) end of life label Sep 19, 2018
    @ncoghlan
    Copy link
    Contributor

    Victor, no - you've completely broken PEP-538 now.

    Please just give up, and implement the PEP as written, and stop trying to use your copious amounts of available development time to railroad me.

    @ncoghlan
    Copy link
    Contributor

    3.7.1 should not ship until PEP-538 is once more implemented as documented, without Victor's personal editorialising and feature additions to a maintenance release.

    @ncoghlan
    Copy link
    Contributor

    The only reason this got through my original review was because PEP-540 was implemented when I was going through a personal situation that lead to me quitting my job cold without a new one to go to, so my review of Victors changes to the PEP-538 implementation was pretty cursory.

    Now that I've finally reviewed it properly, I can see those changes were fundamentally misguided, and completely missed the intent and purpose of the PEP - it's deliberately designed to work completely differently from PEP-540, but Victor's now turned it into a pale shadow of the latter. (Victor's patch even had to change the locale coercion tests because it broke the isolated mode behaviour described in the PEP)

    I was trying to be nice about it, but given that Victor is starting to just go ahead and change things further, I'm going to go with the hardline veto option: make PEP-538 work the it is described in the PEP and stop trying to change it on an arbitrary whim without writing a new PEP to overrule the original one.

    @ncoghlan
    Copy link
    Contributor

    (Note: I won't have time to work on this myself until this weekend at the earliest)

    @ncoghlan
    Copy link
    Contributor

    test_c_locale_coercion *did* test the -E and -I options, by running everything in isolated mode. This was only broken by Victor's changes to the test suite (which broke locale coercion in isolated mode, and hence broke the tests).

    As for why PYTHONCLOCALECOERCION=0 and PYTHONCLOCALECOERCION=warn exist even though I don't actually want people to use them, it's because they're a potentially necessary debugging tool if you're seeing an issue that you suspect may be due to locale coercion working on one platform (e.g Fedora), and not working on another (e.g. CentOS 7). You can't easily hide the C.UTF-8 locale from the Python interpreter on Fedora, but you *can* tell the interpreter not to use it.

    UTF-8 mode is different, as folks may want to opt in to that regardless of their current locale setting.

    @ncoghlan
    Copy link
    Contributor

    (The other reason this change should be reverted is because it added a new feature to Python 3.7.1, even though in the earlier discussion we had agreed to leave the as-shipped implementation in 3.7.0 alone for the rest of the Python 3.7.x releases, and only discuss how we wanted it to work in Python 3.8, precisely so that we could have the discussion on a timeline that was amenable to me, without impacting the 3.7.1 release schedule)

    @vstinner
    Copy link
    Member Author

    The issue is now discussed on python-dev. My latest email:
    https://mail.python.org/pipermail/python-dev/2018-September/155242.html

    @vstinner
    Copy link
    Member Author

    New changeset 95cc3ee by Victor Stinner in branch '3.7':
    Revert "[3.7] bpo-34589: Add -X coerce_c_locale option; C locale coercion off by default (GH-9379)" (GH-9416)
    95cc3ee

    @vstinner
    Copy link
    Member Author

    New changeset 06e7608 by Victor Stinner in branch 'master':
    Revert "bpo-34589: Add -X coerce_c_locale command line option (GH-9378)" (GH-9430)
    06e7608

    @ncoghlan
    Copy link
    Contributor

    #9257 should be complete now, although I may have gone overboard on the documentation updates (as it seems to be unclear, at least to Victor, why PYTHONCOERCECLOCALE exists in the first place)

    @ncoghlan
    Copy link
    Contributor

    (Also, the patch is currently still written on the assumption that it won't be backported to 3.7.1. I haven't checked if it would apply cleanly to 3.7.x, but I suspect not, given the magnitude of the startup changes targeting 3.8.0 since 3.7.0 was released)

    @ned-deily
    Copy link
    Member

    Thanks, Victor and Nick, for the work. With the revert merged (GH-9430), this issue is no longer a release blocker for 3.7.1, correct?

    @ncoghlan
    Copy link
    Contributor

    Correct - it won't change anything from 3.7.0, and even the original discrepancies relative to PEP-538 only affect applications that:

    1. Are embedding a CPython runtime
    2. *Aren't* already ensuring that they're running in a locale other than the C locale
    3. Aren't able to set PYTHONCOERCECLOCALE=0 or LC_ALL=C in the environment prior to calling Py_Initialize() or Py_Main() in order to disable the locale coercion

    (Because of the high quality of Victor's work on the UTF-8 mode implementation, it's practically impossible to externally distinguish between the behaviour of late coercion and the intended early coercion in the CPython CLI itself)

    @ncoghlan
    Copy link
    Contributor

    ncoghlan commented Nov 6, 2018

    The discussion with Victor on https://bugs.python.org/issue34914 highlighted the fact that it's OK to use 8-bit string comparisons to check for "-E", "-I", and a "-X coerce_legacy_c_locale=0" due to the fact that all encodings used as locale encodings are sufficiently ASCII compatible for that to work as desired.

    So before proceeding with merging #9257, I'm going to review that possibility, and see how much code it would actually add to support an extra internal helper API like:

        int _Py_LegacyLocaleCoercionEnabled(int argv, char* argv[]);

    That would then inspect the unprocessed 8-bit command line arguments, as well as the process environment, to determine whether or not locale coercion should be attempted, making the complete dance:

    _Py_SetLocaleFromEnv(LC_CTYPE);
    if (_Py_LegacyLocaleDetected() && _Py_LegacyLocaleCoercionEnabled(argc, argv))
    {
        _Py_CoerceLegacyLocale(&pymain.report_locale_coercion);
    }
    

    If we wanted to officially expose this for embedding apps, the API would need a bit more thought, so it probably makes sense to wait and see if the nominally-private-but-exposed-to-the-linker approach is good enough in practice.

    @ncoghlan
    Copy link
    Contributor

    If I've understood https://bugs.python.org/issue35290 and its resolution in f6e323c correctly, that's a concrete example of why I consider the early coercion approach that avoids any kind of redecoding dance significantly more robust against issues arising from future changes in the CPython startup sequence.

    I'm away for AWS Re:Invent until Dec 2, but after I get back, I'll finally sit down and take one last pass over this to see if I can come up with a version that Victor is actually willing to review and approve.

    @vstinner
    Copy link
    Member Author

    C locale coercion is just one parameter used to select the Python
    filesystem enconding. There are many others which are read later during
    Python initialization, and they are inter-dependent.

    /* Python filesystem encoding and error handler:

    I don't see which problem you are trying to solve here.

    For me, PYTHONCOERCECLOCALE environment variable must be ignored when -I or
    -E is used, for security reasons and to get more deterministic behavior. So
    I am opposed to read/enable C locale coercion earlier.

    That's also why PYTHONMALLOC only change the memory allocator late in the
    Python initialisation, whereas it would be simpler to read/apply the option
    earlier.

    For the FreeBSD CURRENT bug, I recently modified Python to use ASCII if
    "force ASCII" mode is enabled. I made this change for... HP-UX! On FreeBSD,
    the filesystem encoding was already ASCII in practice (FreeBSD uses Latin1
    but announes ASCII!?). I am not surprised to see bugs on specific
    configuration with special option (disable UTF-8 mode) :-)

    I am fixing Unicode bugs on weird cased before 3.0 and we are still not
    done! Recently I fixed a bug in localeconv() when LC_MONETARY uses a
    different encoding than LC_CTYPE... Such config sounds like a bad choice,
    but well, it "works" on Python2...

    @ncoghlan
    Copy link
    Contributor

    I still think the current Python 3.7 behaviour makes the CPython runtime a badly behaved C library, since we may end up changing the active libc locale even when it isn't the main application, and a change hasn't been explicitly requested by the embedding app.

    However, the only way to encounter that misbehaviour is to be running a Unicode-unfriendly embedding application that leaves the C locale in place on a Unicode-unfriendly operating system that doesn't set a more sensible locale in the first place.

    I've also gone through a few iterations on #9257 now, and respecting the -E and -I options *without* deferring locale coercion until the same point where UTF-8 mode is checked for turns out to require duplication of an irritatingly large amount of argument processing code (as even if all you're checking for is two options that don't take arguments, you still need to correctly skip over all the other permitted arguments, stop at the arguments that terminate the python option list, and find the options of interest even when they're in a combined option string like "-vI").

    Accordingly, I no longer think it's worth pursuing this change purely to assuage my sense of responsibility to developers of applications embedding the Python runtime - instead, we can leave the Python 3.7.0 solution in place for 3.8 as well, until such time as we have embedding application authors actually reporting bug reports against the Python 3.7 behaviour.

    To be completely honest, I expect the odds of that actually happening in practice to be incredibly low, and even if it does happen, our answer may well be to point to the in-development multi-phase configuration API rather than allowing it to be disabled when using Py_Initialize() and/or Py_Main().

    @ncoghlan ncoghlan added the type-bug An unexpected behavior, bug, or error label Dec 19, 2018
    @vstinner
    Copy link
    Member Author

    vstinner commented Apr 5, 2019

    I fixed the issue in Python 3.8 with bpo-36443:

    commit d929f18
    Author: Victor Stinner <vstinner@redhat.com>
    Date: Wed Mar 27 18:28:46 2019 +0100

    bpo-36443: Disable C locale coercion and UTF-8 Mode by default (GH-12589)
    

    I'm not sure if it would be accceptable to the behavior in Python 3.7.x, especially without PEP-587 which targets Python 3.8.

    @ncoghlan
    Copy link
    Contributor

    ncoghlan commented Apr 7, 2019

    I'm inclined to leave 3.7 alone unless/until we get an actual bug report for the current behaviour - if an embedder finds that it just straight up doesn't work for them, then they can either resort to tinkering with the private APIs (since they won't change until Python 3.8, and at that point we expect to have a public API available), or else stick with Python 3.6 until later this year when Python 3.8 is available.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life 3.8 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants