classification
Title: Py_Initialize() and Py_Main() should not enable C locale coercion
Type: Stage: patch review
Components: Interpreter Core Versions: Python 3.8, Python 3.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: ncoghlan, ned.deily
Priority: normal Keywords: patch

Created on 2018-09-05 19:36 by vstinner, last changed 2018-11-06 11:07 by ncoghlan.

Pull Requests
URL Status Linked Edit
PR 9073 merged vstinner, 2018-09-05 19:40
PR 9257 open ncoghlan, 2018-09-13 15:16
PR 9371 merged vstinner, 2018-09-17 21:17
PR 9378 merged vstinner, 2018-09-17 23:45
PR 9379 merged vstinner, 2018-09-18 00:42
PR 9382 closed vstinner, 2018-09-18 02:05
PR 9416 merged vstinner, 2018-09-19 11:08
PR 9430 merged vstinner, 2018-09-19 21:23
Messages (33)
msg324649 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-09-05 19:36
According to Nick Coghlan, author of the PEP 538, the C locale coercion should not be enabled by Py_Initialize() and Py_Main(), only by the python3 program.

Attached PR fix this issue.
msg325008 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2018-09-11 14:20
The entire change affecting the PEP 538 implementation in https://github.com/python/cpython/commit/9454060e84a669dde63824d9e2fcaf295e34f687#diff-8c018c3ada66d06c8e101e47a313c2c7 needs to be reverted: the locale should be coerced before *ANY* calls are made to Py_DecodeLocale, as the whole point of the architecture of the PEP implementation was to ensure that *nothing ever gets decoded incorrectly in the first place*.

I just initially missed that in Victor's enthusiasm for fixing the incorrect decodings that can arise in the absence of locale coercion he'd *introduced* incorrect decoding into the locale coercion case.
msg325009 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2018-09-11 14:28
(The one exception to "nothing gets decoded incorrectly" is that PYTHONCOERCECLOCALE itself is always interpreted as an ASCII field: the values that it checks for are actually ASCII byte sequences, not Unicode code points.

The documentation could definitely be much clearer on that point though, as even in the PEP it's only implied by the final paragraph in https://www.python.org/dev/peps/pep-0538/#legacy-c-locale-coercion-in-the-standalone-python-interpreter-binary which is mostly talking about the fact that this particular environment variable is still checked, even if you pass the -I or -E command line options.
msg325010 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2018-09-11 14:38
FWIW, this is also why I don't want PEP 432 to expose any wstr fields in the configuration settings: it means embedding applications have to figure out which encoding Python is expecting to be used for those fields.

Everything should be much cleaner if the public API requires those settings to be provided as Py_Unicode objects, and the core configuration step takes care of getting the C API to the point where the embedding application can create Py_Unicode objects for passing to later steps.

The fact that CPython itself currently uses wstr to configure these settings should be considered a legacy implementation detail that we do our best to hide from the public API.
msg325093 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-09-11 23:37
> (...) the locale should be coerced before *ANY* calls are made to Py_DecodeLocale, as the whole point of the architecture of the PEP implementation was to ensure that *nothing ever gets decoded incorrectly in the first place*.

See my comment:
https://github.com/python/cpython/pull/9073#issuecomment-420461537
msg325097 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-09-11 23:48
> (...) The documentation could definitely be much clearer on that point though, as even in the PEP it's only implied by the final paragraph in https://www.python.org/dev/peps/pep-0538/#legacy-c-locale-coercion-in-the-standalone-python-interpreter-binary which is mostly talking about the fact that this particular environment variable is still checked, even if you pass the -I or -E command line options.

That's a separated issue: I created bpo-34639.
msg325098 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-09-11 23:51
Example of C locale coercion with Python 3.7:

$ env -i PYTHONCOERCECLOCALE=0 ./python -X utf8=0 -c 'import sys; print(ascii(sys.argv[1]))' 'hé'
'h\udcc3\udca9'

$ env -i PYTHONCOERCECLOCALE=1 ./python -X utf8=0 -c 'import sys; print(ascii(sys.argv[1]))' 'hé'
'h\xe9'

It works as expected no? What's wrong with the current implementation according to you, Nick?
msg325242 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2018-09-13 12:46
Reviewing the way things are now, I don't think it's worth rearranging this yet again for 3.7 - instead, I'll just put it back the way I intended it to be (and the way the PEP describes) for 3.8.

The issue is that Victor had good reasons for moving the handling of the "PYTHONCOERCECLOCALE=warn" case to be after the low level C stdio streams were properly initialised, and retaining that aspect of the changes while still moving the C locale coercion back to the beginning of the program means changing the signature of _Py_CoerceLegacyLocale to return a stateless static string to be printed to stderr later.
msg325246 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2018-09-13 14:15
The actual functional error is that the following will currently give different outputs on Fedora and CentOS 7, whereas in the original PEP 538 implementation it would always print "C", even if locale coercion would otherwise normally work on your current platform:

    LC_CTYPE=C PYTHONCOERCECLOCALE=0 python3 -E -c 'import os; print(os.env["LC_CTYPE"])'


As per the comment that was deleted in https://github.com/python/cpython/commit/9454060e84a669dde63824d9e2fcaf295e34f687#diff-8c018c3ada66d06c8e101e47a313c2c7, that was an intentional behaviour to allow developers testing scripts that are invoked using -E and -I to pretend that their local C.UTF-8 locale didn't exist when trying to reproducing behaviour otherwise seen only when locale coercion fails due to the lack of a suitable coercion target.

I'm working on an alternate PR that restores that aspect of the original behaviour, while preserving all the other improvements you made (such as ensuring that emitting the warning gets delayed until after the C level stderr is properly initialised)
msg325256 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2018-09-13 15:22
Alternate PR is up at https://github.com/python/cpython/pull/9257

There's one setting that remains in CoreConfig: "warn_on_c_locale", which _Py_UnixMain also uses to decide whether or not to actually emit the passed in locale coercion warning.

The coercion warning information itself moves out of CoreConfig, and into the private _PyMain helper struct.

The signature of _Py_CoerceLegacyLocale changed to both return an integer indicating whether or not coercion took place, as well as supporting two output variables that allow the display of the coercion warning message to be deferred (or skipped entirely).
msg325282 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-09-13 18:50
> LC_CTYPE=C PYTHONCOERCECLOCALE=0 python3 -E ...

The UTF-8 Mode has the same behavior: PYTHONUTF8 env var is ignored when using -E, but enabled by the C locale: you can use -X utf8=0 in that case to ensure that the UTF-8 Mode is disabled.

I wanted to propose you to add a new -X option, ex: -X coerce_c_locale=0, which would have the priority over the environment variable. I'm not a big fan of env vars, it's not always convenient to use them. I prefer an env var, but both are even better :-)
msg325493 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2018-09-16 15:02
I'd strongly prefer to just go back to the PEP 538 design. It's much simpler to implement, we don't actually want anyone turning off locale coercion except for debugging purposes (unlike UTF-8 mode), and the only argument against doing this the way PEP 538 describes is a purist one, not a practical one (which was already resolved in favour of practicality when PEP 538 was accepted).

However, if you were willing to do the updates on my PR branch to implement it, then I'd accept an alternative that added a pass through the command line arguments that checks for at least one of the two following cases after a legacy locale has been detected:

- the new option "-X coerce_c_locale=0" (ASCII) is present in the arg list
- neither "-E" nor "-I" (ASCII) are present in the arg list, and PYTHONCOERCECLOCALE is not set to zero

The code that sets warn_on_c_locale in the core config would then look for `-X coerce_c_locale=warn` in addition to looking for `PYTHONCOERCECLOCALE=warn`.

That's quite a bit of code to add for the sake of a flag we don't really want anyone to ever use, though. (If it hadn't been for the debugging-CentOS-7-problems-on-Fedora issue, I doubt I would have included the off switch in PEP 538 at all)
msg325554 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-09-17 16:55
> I'd strongly prefer to just go back to the PEP 538 design. It's much simpler to implement, we don't actually want anyone turning off locale coercion except for debugging purposes (unlike UTF-8 mode), and the only argument against doing this the way PEP 538 describes is a purist one, not a practical one (which was already resolved in favour of practicality when PEP 538 was accepted).
> (...)
> That's quite a bit of code to add for the sake of a flag we don't really want anyone to ever use, though. (If it hadn't been for the debugging-CentOS-7-problems-on-Fedora issue, I doubt I would have included the off switch in PEP 538 at all)

Why did you add PYTHONCOERCECLOCALE=warn if you don't want that people use it?

> The actual functional error is that the following will currently give different outputs on Fedora and CentOS 7, whereas in the original PEP 538 implementation it would always print "C", even if locale coercion would otherwise normally work on your current platform:
>
>    LC_CTYPE=C PYTHONCOERCECLOCALE=0 python3 -E -c 'import os; print(os.env["LC_CTYPE"])'

You wrote that we need PYTHONCOERCECLOCALE=0 and then that we don't need it, I don't understand :-)

IMHO there are use cases where PYTHONCOERCECLOCALE=0 is important, that's why I propose to add -X coerce_c_locale=0.

There are also cases where you want the warning, so -X coerce_c_locale=warn would help as well, when -E or -I is needed.
msg325588 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-09-17 22:13
New changeset 188ebfa475a6f6aa2d0ea14ca8e1fbe7865b6d27 by Victor Stinner in branch 'master':
bpo-34589: Make _PyCoreConfig.coerce_c_locale private (GH-9371)
https://github.com/python/cpython/commit/188ebfa475a6f6aa2d0ea14ca8e1fbe7865b6d27
msg325597 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-09-17 23:22
New changeset 7a0791b6992d420dc52536257f2f093851ed7215 by Victor Stinner in branch 'master':
bpo-34589: C locale coercion off by default (GH-9073)
https://github.com/python/cpython/commit/7a0791b6992d420dc52536257f2f093851ed7215
msg325600 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-09-18 00:19
New changeset dbdee0073cf0b88fe541980ace1f650900f455cc by Victor Stinner in branch 'master':
bpo-34589: Add -X coerce_c_locale command line option (GH-9378)
https://github.com/python/cpython/commit/dbdee0073cf0b88fe541980ace1f650900f455cc
msg325605 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-09-18 01:01
New changeset 144f1e2c6f4a24bd288c045986842c65cc289684 by Victor Stinner in branch '3.7':
[3.7] bpo-34589: Add -X coerce_c_locale option; C locale coercion off by default (GH-9379)
https://github.com/python/cpython/commit/144f1e2c6f4a24bd288c045986842c65cc289684
msg325606 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2018-09-18 01:45
I'm marking this as a 3.7.1 "release blocker" until Nick has had a chance to review the merged PR and he and Victor are in agreement that this is a satisfactory approach for 3.7.1.
msg325608 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-09-18 02:09
test_c_locale_coercion didn't test the -E option: I wrote PR 9382 to test it.
msg325709 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2018-09-19 06:53
Victor, no - you've completely broken PEP 538 now.

Please just give up, and implement the PEP as written, and stop trying to use your copious amounts of available development time to railroad me.
msg325710 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2018-09-19 06:54
3.7.1 should not ship until PEP 538 is once more implemented as documented, without Victor's personal editorialising and feature additions to a maintenance release.
msg325712 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2018-09-19 06:59
The only reason this got through my original review was because PEP 540 was implemented when I was going through a personal situation that lead to me quitting my job cold without a new one to go to, so my review of Victors changes to the PEP 538 implementation was pretty cursory.

Now that I've finally reviewed it properly, I can see those changes were fundamentally misguided, and completely missed the intent and purpose of the PEP - it's deliberately designed to work completely differently from PEP 540, but Victor's now turned it into a pale shadow of the latter. (Victor's patch even had to change the locale coercion tests because it broke the isolated mode behaviour described in the PEP)

I was trying to be nice about it, but given that Victor is starting to just go ahead and change things further, I'm going to go with the hardline veto option: make PEP 538 work the it is described in the PEP and stop trying to change it on an arbitrary whim without writing a new PEP to overrule the original one.
msg325713 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2018-09-19 07:00
(Note: I won't have time to work on this myself until this weekend at the earliest)
msg325714 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2018-09-19 07:04
test_c_locale_coercion *did* test the -E and -I options, by running everything in isolated mode. This was only broken by Victor's changes to the test suite (which broke locale coercion in isolated mode, and hence broke the tests).

As for why PYTHONCLOCALECOERCION=0 and PYTHONCLOCALECOERCION=warn exist even though I don't actually want people to use them, it's because they're a potentially necessary debugging tool if you're seeing an issue that you suspect may be due to locale coercion working on one platform (e.g Fedora), and not working on another (e.g. CentOS 7). You can't easily hide the C.UTF-8 locale from the Python interpreter on Fedora, but you *can* tell the interpreter not to use it.

UTF-8 mode is different, as folks may want to opt in to that regardless of their current locale setting.
msg325723 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2018-09-19 07:30
(The other reason this change should be reverted is because it added a new feature to Python 3.7.1, even though in the earlier discussion we had agreed to leave the as-shipped implementation in 3.7.0 alone for the rest of the Python 3.7.x releases, and only discuss how we wanted it to work in Python 3.8, precisely so that we could have the discussion on a timeline that was amenable to me, without impacting the 3.7.1 release schedule)
msg325748 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-09-19 11:48
The issue is now discussed on python-dev. My latest email:
https://mail.python.org/pipermail/python-dev/2018-September/155242.html
msg325783 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-09-19 19:01
New changeset 95cc3ee00cfa079751ae2bb9a8d3387053b50489 by Victor Stinner in branch '3.7':
Revert "[3.7] bpo-34589: Add -X coerce_c_locale option; C locale coercion off by default (GH-9379)" (GH-9416)
https://github.com/python/cpython/commit/95cc3ee00cfa079751ae2bb9a8d3387053b50489
msg325806 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-09-19 21:56
New changeset 06e7608207daab9fb82d13ccf2d3664535442f11 by Victor Stinner in branch 'master':
Revert "bpo-34589: Add -X coerce_c_locale command line option (GH-9378)" (GH-9430)
https://github.com/python/cpython/commit/06e7608207daab9fb82d13ccf2d3664535442f11
msg325865 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2018-09-20 12:48
https://github.com/python/cpython/pull/9257 should be complete now, although I may have gone overboard on the documentation updates (as it seems to be unclear, at least to Victor, why PYTHONCOERCECLOCALE exists in the first place)
msg325866 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2018-09-20 12:50
(Also, the patch is currently still written on the assumption that it won't be backported to 3.7.1. I haven't checked if it would apply cleanly to 3.7.x, but I suspect not, given the magnitude of the startup changes targeting 3.8.0 since 3.7.0 was released)
msg325948 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2018-09-21 04:37
Thanks, Victor and Nick, for the work.  With the revert merged (GH-9430), this issue is no longer a release blocker for 3.7.1, correct?
msg325984 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2018-09-21 10:52
Correct - it won't change anything from 3.7.0, and even the original discrepancies relative to PEP 538 only affect applications that:

1. Are embedding a CPython runtime
2. *Aren't* already ensuring that they're running in a locale other than the C locale
3. Aren't able to set PYTHONCOERCECLOCALE=0 or LC_ALL=C in the environment prior to calling Py_Initialize() or Py_Main() in order to disable the locale coercion

(Because of the high quality of Victor's work on the UTF-8 mode implementation, it's practically impossible to externally distinguish between the behaviour of late coercion and the intended early coercion in the CPython CLI itself)
msg329351 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2018-11-06 11:07
The discussion with Victor on https://bugs.python.org/issue34914 highlighted the fact that it's OK to use 8-bit string comparisons to check for "-E", "-I", and a "-X coerce_legacy_c_locale=0" due to the fact that all encodings used as locale encodings are sufficiently ASCII compatible for that to work as desired.

So before proceeding with merging https://github.com/python/cpython/pull/9257, I'm going to review that possibility, and see how much code it would actually add to support an extra internal helper API like:

    int _Py_LegacyLocaleCoercionEnabled(int argv, char* argv[]);

That would then inspect the unprocessed 8-bit command line arguments, as well as the process environment, to determine whether or not locale coercion should be attempted, making the complete dance:

    _Py_SetLocaleFromEnv(LC_CTYPE);
    if (_Py_LegacyLocaleDetected() && _Py_LegacyLocaleCoercionEnabled(argc, argv))
    {
        _Py_CoerceLegacyLocale(&pymain.report_locale_coercion);
    }

If we wanted to officially expose this for embedding apps, the API would need a bit more thought, so it probably makes sense to wait and see if the nominally-private-but-exposed-to-the-linker approach is good enough in practice.
History
Date User Action Args
2018-11-06 11:07:28ncoghlansetmessages: + msg329351
2018-09-30 07:14:00ncoghlanlinkissue30672 dependencies
2018-09-21 14:22:05vstinnersetnosy: - vstinner
2018-09-21 10:52:49ncoghlansetpriority: release blocker -> normal

messages: + msg325984
2018-09-21 04:37:22ned.deilysetmessages: + msg325948
2018-09-20 12:50:14ncoghlansetmessages: + msg325866
2018-09-20 12:48:32ncoghlansetmessages: + msg325865
2018-09-19 23:05:10vstinnerlinkissue34639 superseder
2018-09-19 21:56:40vstinnersetmessages: + msg325806
2018-09-19 21:23:45vstinnersetpull_requests: + pull_request8847
2018-09-19 19:01:56vstinnersetmessages: + msg325783
2018-09-19 11:48:22vstinnersetmessages: + msg325748
2018-09-19 11:08:14vstinnersetpull_requests: + pull_request8836
2018-09-19 07:30:54ncoghlansetmessages: + msg325723
2018-09-19 07:04:42ncoghlansetmessages: + msg325714
2018-09-19 07:00:35ncoghlansetmessages: + msg325713
2018-09-19 06:59:12ncoghlansetmessages: + msg325712
2018-09-19 06:54:02ncoghlansetmessages: + msg325710
2018-09-19 06:53:14ncoghlansetmessages: + msg325709
2018-09-19 01:51:28ned.deilysetversions: + Python 3.7
2018-09-18 02:09:37vstinnersetmessages: + msg325608
2018-09-18 02:05:11vstinnersetpull_requests: + pull_request8806
2018-09-18 01:45:29ned.deilysetmessages: + msg325606
2018-09-18 01:42:57ned.deilysetpriority: normal -> release blocker
2018-09-18 01:01:44vstinnersetmessages: + msg325605
2018-09-18 00:42:19vstinnersetpull_requests: + pull_request8803
2018-09-18 00:19:32vstinnersetmessages: + msg325600
2018-09-17 23:45:56vstinnersetpull_requests: + pull_request8802
2018-09-17 23:22:33vstinnersetmessages: + msg325597
2018-09-17 22:13:21vstinnersetmessages: + msg325588
2018-09-17 21:17:49vstinnersetpull_requests: + pull_request8794
2018-09-17 16:55:04vstinnersetmessages: + msg325554
2018-09-16 15:02:07ncoghlansetmessages: + msg325493
2018-09-13 18:50:29vstinnersetmessages: + msg325282
2018-09-13 15:22:04ncoghlansetmessages: + msg325256
2018-09-13 15:16:20ncoghlansetpull_requests: + pull_request8689
2018-09-13 14:15:32ncoghlansetmessages: + msg325246
2018-09-13 12:46:30ncoghlansetpriority: release blocker -> normal

messages: + msg325242
versions: - Python 3.7
2018-09-11 23:51:13vstinnersetmessages: + msg325098
2018-09-11 23:48:06vstinnersetmessages: + msg325097
2018-09-11 23:37:18vstinnersetmessages: + msg325093
2018-09-11 14:38:51ncoghlansetmessages: + msg325010
2018-09-11 14:28:27ncoghlansetmessages: + msg325009
2018-09-11 14:20:56ncoghlansetpriority: normal -> release blocker
nosy: + ned.deily, ncoghlan
messages: + msg325008

2018-09-05 19:40:26vstinnersetkeywords: + patch
stage: patch review
pull_requests: + pull_request8532
2018-09-05 19:36:35vstinnercreate