New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Py_Initialize() and Py_Main() should not enable C locale coercion #78770
Comments
According to Nick Coghlan, author of the PEP-538, the C locale coercion should not be enabled by Py_Initialize() and Py_Main(), only by the python3 program. Attached PR fix this issue. |
The entire change affecting the PEP-538 implementation in 9454060#diff-8c018c3ada66d06c8e101e47a313c2c7 needs to be reverted: the locale should be coerced before *ANY* calls are made to Py_DecodeLocale, as the whole point of the architecture of the PEP implementation was to ensure that *nothing ever gets decoded incorrectly in the first place*. I just initially missed that in Victor's enthusiasm for fixing the incorrect decodings that can arise in the absence of locale coercion he'd *introduced* incorrect decoding into the locale coercion case. |
(The one exception to "nothing gets decoded incorrectly" is that PYTHONCOERCECLOCALE itself is always interpreted as an ASCII field: the values that it checks for are actually ASCII byte sequences, not Unicode code points. The documentation could definitely be much clearer on that point though, as even in the PEP it's only implied by the final paragraph in https://www.python.org/dev/peps/pep-0538/#legacy-c-locale-coercion-in-the-standalone-python-interpreter-binary which is mostly talking about the fact that this particular environment variable is still checked, even if you pass the -I or -E command line options. |
FWIW, this is also why I don't want PEP-432 to expose any wstr fields in the configuration settings: it means embedding applications have to figure out which encoding Python is expecting to be used for those fields. Everything should be much cleaner if the public API requires those settings to be provided as Py_Unicode objects, and the core configuration step takes care of getting the C API to the point where the embedding application can create Py_Unicode objects for passing to later steps. The fact that CPython itself currently uses wstr to configure these settings should be considered a legacy implementation detail that we do our best to hide from the public API. |
See my comment: |
That's a separated issue: I created bpo-34639. |
Example of C locale coercion with Python 3.7: $ env -i PYTHONCOERCECLOCALE=0 ./python -X utf8=0 -c 'import sys; print(ascii(sys.argv[1]))' 'hé'
'h\udcc3\udca9'
$ env -i PYTHONCOERCECLOCALE=1 ./python -X utf8=0 -c 'import sys; print(ascii(sys.argv[1]))' 'hé'
'h\xe9' It works as expected no? What's wrong with the current implementation according to you, Nick? |
Reviewing the way things are now, I don't think it's worth rearranging this yet again for 3.7 - instead, I'll just put it back the way I intended it to be (and the way the PEP describes) for 3.8. The issue is that Victor had good reasons for moving the handling of the "PYTHONCOERCECLOCALE=warn" case to be after the low level C stdio streams were properly initialised, and retaining that aspect of the changes while still moving the C locale coercion back to the beginning of the program means changing the signature of _Py_CoerceLegacyLocale to return a stateless static string to be printed to stderr later. |
The actual functional error is that the following will currently give different outputs on Fedora and CentOS 7, whereas in the original PEP-538 implementation it would always print "C", even if locale coercion would otherwise normally work on your current platform: LC_CTYPE=C PYTHONCOERCECLOCALE=0 python3 -E -c 'import os; print(os.env["LC_CTYPE"])' As per the comment that was deleted in 9454060#diff-8c018c3ada66d06c8e101e47a313c2c7, that was an intentional behaviour to allow developers testing scripts that are invoked using -E and -I to pretend that their local C.UTF-8 locale didn't exist when trying to reproducing behaviour otherwise seen only when locale coercion fails due to the lack of a suitable coercion target. I'm working on an alternate PR that restores that aspect of the original behaviour, while preserving all the other improvements you made (such as ensuring that emitting the warning gets delayed until after the C level stderr is properly initialised) |
Alternate PR is up at #9257 There's one setting that remains in CoreConfig: "warn_on_c_locale", which _Py_UnixMain also uses to decide whether or not to actually emit the passed in locale coercion warning. The coercion warning information itself moves out of CoreConfig, and into the private _PyMain helper struct. The signature of _Py_CoerceLegacyLocale changed to both return an integer indicating whether or not coercion took place, as well as supporting two output variables that allow the display of the coercion warning message to be deferred (or skipped entirely). |
The UTF-8 Mode has the same behavior: PYTHONUTF8 env var is ignored when using -E, but enabled by the C locale: you can use -X utf8=0 in that case to ensure that the UTF-8 Mode is disabled. I wanted to propose you to add a new -X option, ex: -X coerce_c_locale=0, which would have the priority over the environment variable. I'm not a big fan of env vars, it's not always convenient to use them. I prefer an env var, but both are even better :-) |
I'd strongly prefer to just go back to the PEP-538 design. It's much simpler to implement, we don't actually want anyone turning off locale coercion except for debugging purposes (unlike UTF-8 mode), and the only argument against doing this the way PEP-538 describes is a purist one, not a practical one (which was already resolved in favour of practicality when PEP-538 was accepted). However, if you were willing to do the updates on my PR branch to implement it, then I'd accept an alternative that added a pass through the command line arguments that checks for at least one of the two following cases after a legacy locale has been detected:
The code that sets warn_on_c_locale in the core config would then look for That's quite a bit of code to add for the sake of a flag we don't really want anyone to ever use, though. (If it hadn't been for the debugging-CentOS-7-problems-on-Fedora issue, I doubt I would have included the off switch in PEP-538 at all) |
Why did you add PYTHONCOERCECLOCALE=warn if you don't want that people use it?
You wrote that we need PYTHONCOERCECLOCALE=0 and then that we don't need it, I don't understand :-) IMHO there are use cases where PYTHONCOERCECLOCALE=0 is important, that's why I propose to add -X coerce_c_locale=0. There are also cases where you want the warning, so -X coerce_c_locale=warn would help as well, when -E or -I is needed. |
I'm marking this as a 3.7.1 "release blocker" until Nick has had a chance to review the merged PR and he and Victor are in agreement that this is a satisfactory approach for 3.7.1. |
test_c_locale_coercion didn't test the -E option: I wrote PR 9382 to test it. |
Victor, no - you've completely broken PEP-538 now. Please just give up, and implement the PEP as written, and stop trying to use your copious amounts of available development time to railroad me. |
3.7.1 should not ship until PEP-538 is once more implemented as documented, without Victor's personal editorialising and feature additions to a maintenance release. |
The only reason this got through my original review was because PEP-540 was implemented when I was going through a personal situation that lead to me quitting my job cold without a new one to go to, so my review of Victors changes to the PEP-538 implementation was pretty cursory. Now that I've finally reviewed it properly, I can see those changes were fundamentally misguided, and completely missed the intent and purpose of the PEP - it's deliberately designed to work completely differently from PEP-540, but Victor's now turned it into a pale shadow of the latter. (Victor's patch even had to change the locale coercion tests because it broke the isolated mode behaviour described in the PEP) I was trying to be nice about it, but given that Victor is starting to just go ahead and change things further, I'm going to go with the hardline veto option: make PEP-538 work the it is described in the PEP and stop trying to change it on an arbitrary whim without writing a new PEP to overrule the original one. |
(Note: I won't have time to work on this myself until this weekend at the earliest) |
test_c_locale_coercion *did* test the -E and -I options, by running everything in isolated mode. This was only broken by Victor's changes to the test suite (which broke locale coercion in isolated mode, and hence broke the tests). As for why PYTHONCLOCALECOERCION=0 and PYTHONCLOCALECOERCION=warn exist even though I don't actually want people to use them, it's because they're a potentially necessary debugging tool if you're seeing an issue that you suspect may be due to locale coercion working on one platform (e.g Fedora), and not working on another (e.g. CentOS 7). You can't easily hide the C.UTF-8 locale from the Python interpreter on Fedora, but you *can* tell the interpreter not to use it. UTF-8 mode is different, as folks may want to opt in to that regardless of their current locale setting. |
(The other reason this change should be reverted is because it added a new feature to Python 3.7.1, even though in the earlier discussion we had agreed to leave the as-shipped implementation in 3.7.0 alone for the rest of the Python 3.7.x releases, and only discuss how we wanted it to work in Python 3.8, precisely so that we could have the discussion on a timeline that was amenable to me, without impacting the 3.7.1 release schedule) |
The issue is now discussed on python-dev. My latest email: |
#9257 should be complete now, although I may have gone overboard on the documentation updates (as it seems to be unclear, at least to Victor, why PYTHONCOERCECLOCALE exists in the first place) |
(Also, the patch is currently still written on the assumption that it won't be backported to 3.7.1. I haven't checked if it would apply cleanly to 3.7.x, but I suspect not, given the magnitude of the startup changes targeting 3.8.0 since 3.7.0 was released) |
Thanks, Victor and Nick, for the work. With the revert merged (GH-9430), this issue is no longer a release blocker for 3.7.1, correct? |
Correct - it won't change anything from 3.7.0, and even the original discrepancies relative to PEP-538 only affect applications that:
(Because of the high quality of Victor's work on the UTF-8 mode implementation, it's practically impossible to externally distinguish between the behaviour of late coercion and the intended early coercion in the CPython CLI itself) |
The discussion with Victor on https://bugs.python.org/issue34914 highlighted the fact that it's OK to use 8-bit string comparisons to check for "-E", "-I", and a "-X coerce_legacy_c_locale=0" due to the fact that all encodings used as locale encodings are sufficiently ASCII compatible for that to work as desired. So before proceeding with merging #9257, I'm going to review that possibility, and see how much code it would actually add to support an extra internal helper API like: int _Py_LegacyLocaleCoercionEnabled(int argv, char* argv[]); That would then inspect the unprocessed 8-bit command line arguments, as well as the process environment, to determine whether or not locale coercion should be attempted, making the complete dance:
If we wanted to officially expose this for embedding apps, the API would need a bit more thought, so it probably makes sense to wait and see if the nominally-private-but-exposed-to-the-linker approach is good enough in practice. |
If I've understood https://bugs.python.org/issue35290 and its resolution in f6e323c correctly, that's a concrete example of why I consider the early coercion approach that avoids any kind of redecoding dance significantly more robust against issues arising from future changes in the CPython startup sequence. I'm away for AWS Re:Invent until Dec 2, but after I get back, I'll finally sit down and take one last pass over this to see if I can come up with a version that Victor is actually willing to review and approve. |
C locale coercion is just one parameter used to select the Python Line 69 in 7f4ba4a
I don't see which problem you are trying to solve here. For me, PYTHONCOERCECLOCALE environment variable must be ignored when -I or That's also why PYTHONMALLOC only change the memory allocator late in the For the FreeBSD CURRENT bug, I recently modified Python to use ASCII if I am fixing Unicode bugs on weird cased before 3.0 and we are still not |
I still think the current Python 3.7 behaviour makes the CPython runtime a badly behaved C library, since we may end up changing the active libc locale even when it isn't the main application, and a change hasn't been explicitly requested by the embedding app. However, the only way to encounter that misbehaviour is to be running a Unicode-unfriendly embedding application that leaves the C locale in place on a Unicode-unfriendly operating system that doesn't set a more sensible locale in the first place. I've also gone through a few iterations on #9257 now, and respecting the -E and -I options *without* deferring locale coercion until the same point where UTF-8 mode is checked for turns out to require duplication of an irritatingly large amount of argument processing code (as even if all you're checking for is two options that don't take arguments, you still need to correctly skip over all the other permitted arguments, stop at the arguments that terminate the python option list, and find the options of interest even when they're in a combined option string like "-vI"). Accordingly, I no longer think it's worth pursuing this change purely to assuage my sense of responsibility to developers of applications embedding the Python runtime - instead, we can leave the Python 3.7.0 solution in place for 3.8 as well, until such time as we have embedding application authors actually reporting bug reports against the Python 3.7 behaviour. To be completely honest, I expect the odds of that actually happening in practice to be incredibly low, and even if it does happen, our answer may well be to point to the in-development multi-phase configuration API rather than allowing it to be disabled when using Py_Initialize() and/or Py_Main(). |
I fixed the issue in Python 3.8 with bpo-36443: commit d929f18
I'm not sure if it would be accceptable to the behavior in Python 3.7.x, especially without PEP-587 which targets Python 3.8. |
I'm inclined to leave 3.7 alone unless/until we get an actual bug report for the current behaviour - if an embedder finds that it just straight up doesn't work for them, then they can either resort to tinkering with the private APIs (since they won't change until Python 3.8, and at that point we expect to have a public API available), or else stick with Python 3.6 until later this year when Python 3.8 is available. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: