-
-
Notifications
You must be signed in to change notification settings - Fork 29.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Py_DecodeLocale() fails if used before the runtime is initialized. #76277
Comments
(see the python-dev thread [1]) When I consolidated the global runtime state into a single global, _PyRuntime, calls Py_DecodeLocale() started to break if the runtime hadn't been intialized yet. This is because that function relies on PyMem_RawMalloc() and PyMem_RawFree(), which rely on the raw allocator having been initialized as part of the runtime (it used to be intialized statically). The documentation for various "Process-wide parameters" [2] explicitly directs users to call Py_DecodeLocale() where necessary. The docs for Py_DecodeLocale(), in turn, explicitly refer to calling PyMem_RawFree(). So changing the pre-runtime-init behavior of Py_DecodeLocale() and PyMem_RawFree() is a regression that should be fixed. [1] https://mail.python.org/pipermail/python-dev/2017-November/150605.html |
Interesting, on 3.6.3 on my embedded program it seems to work just fine. Did anything change in it since then? https://github.com/AraHaan/Els_kom_new/blob/master/PC/komextract_new.c |
Duplicate of bpo-32086. |
While they're definitely closely related, I don't think this and bpo-32086 are actually duplicates: this issue is "fix the current Py_DecodeLocale regression in 3.7 by reverting back to the 3.6 behaviour", while bpo-32086 is "Should we deprecate our implied support for calling Py_DecodeLocale() before calling Py_Initialize()?". The latter is certainly a valid question, but we should restore the 3.6 behaviour while we're considering it. |
I see at least 3 ways to sort this out:
I considered implementing #3 instead, but wasn't sure about the performance impact. It would add a pointer comparison to NULL and a branch on each PyMem_RawMalloc() and PyMem_RawFree() call. I'm not convinved it would make much of a difference. Furthermore, I consider #3 to be the simplest solution, both to implement and to maintain, so I'll probably try it out. |
I thought issue bpo-32086 was about documentation (which is worth having a separate issue for), not about a fix to the regression. |
"3. use hard-coded defaults in PyMem_RawMalloc() and PyMem_RawFree() if the runtime has not been initialized yet" I dislike this option since it can have a negative impact on performances. The PEP-445 already added a new level of indirection and so made memory allocations a little bit slower. |
I marked bpo-32096 as a duplicate of this one. I don't want to discuss the same issue in 3 places (2 bpo and python-dev). |
Nick: "Should we deprecate our implied support for calling Py_DecodeLocale() before calling Py_Initialize()?" Please don't do that. Py_DecodeLocale() is the best available function to decode paths and environment variables to initialize Python. The implementation of Py_DecodeLocale() is very complex, but we need this complexity to decode "correctly" OS data. |
As I explained, the code to initialize PyMem_Raw allocator is complex and I would really prefer to only initialize it "partially" to prevent bad surprises. |
IMHO for the long term, the best would be to have a structure for pre-Py_Initialize configuration, maybe _PyCoreConfig, and pass it to functions that can be called before Py_Initialize(). For example, I'm working on a variant of Py_GetPath() which accepts a _PyCoreConfig to be able to pass the value of the PYTHONPATH environment variable. That's a more complex solution, so I proposed to first revert, to have more time to design the "correct API". |
+1 As an alternative to that, we could also deprecate using any of those functions before initializing the runtime. Instead of calling them, you would set the relevant info on the runtime "config" struct that you pass to the replacement for PyInitialize(). At that point we would not need some of those functions any longer and we could remove them (eventually, once backward-compatibility is resolved). Given that the community of CPython embedders is relatively small, we're still in a position to iron this out Regardless, I see where you're coming from. I'm okay with reverting the Object/obmalloc.c parts, but, like I said, I'd rather avoid it if possible. Solution #2 (that I listed above), AKA PR bpo-4481, is focused and effective. Unfortunately, it feels like a bit like a hack to me, though it is a well-contained hack. So I'm not convinced it's the best solution. However, I like it as much as I like reverting the allocators. Solution #3, AKA PR bpo-4495, is nice and clean, but potentially adds a little overhead to all PyMem_RawMalloc() and PyMem_RawFree() calls. All the other PyMem_* functions are unaffected, so perhaps the overall impact is not significant enough to worry.
The surprises would only be pre-initialization, right? After the runtime is initialized, the allocators are in the proper fully-initialized state. So it mostly boils down to what parts of the C-API embedders can use before initialization and how those functions interact with the raw memory allocator. Those constraints narrow down the scope of potential problems to a manageable size (I won't say small, but it feels that way). Ultimately, I favor solution #3 if we can see that it does not impact performance. If we can't come to an agreement in a timely fashion then I'll go along with #1 (revert), so that we don't leave the embedding story broken. If we go that route, do you think we could resolve the initialization issues within the 3.7 timeframe? |
Currently, _PyCoreConfig is not complete: you cannot pass PYTHONPATH or PYTHONHOME. I'm working on patches to implement that. Moreover, there is the question how to decode a bytes path (for PYTHONPATH) into a wchar_t* string. Disallow calling Py_DecodeLocale() before Py_Initialize(): ok, but which alternative do you propose to decode OS data? |
Victor, please stop trying to conflate the two questions of "How should we fix the current Py_DecodeLocale regression before 3.7.0a3?" and "What do we want to do long term?". They're far from being the same question, and answering the second one properly is going to be much harder and more involved than just doing the bare minimum needed to make the previously supported embedding logic work again (even if it means postponing some of the dynamic allocator changes we'd like to make). Omitting PYTHONHOME and PYTHONPATH from the core config is deliberate, as the interpreter doesn't support external imports yet when just the core has been initialized - only builtin and frozen ones. Anything related to external imports should ultimately end up in the main interpreter configuration: https://www.python.org/dev/peps/pep-0432/#supported-configuration-settings Longer term, I also want to rewrite getpath.c in Python (or at least primarily using Python lists and strings via the C API instead of relaying C arrays and C string manipulation functions). However, our work on refactoring Py_Main has also shown me that we're going to need some additional structs to hold the raw(ish) command line arguments and environment variables in order to easily pass them around to other internal configuration APIs. Modules/main.c already defines _Py_CommandLineDetails for the command line settings: https://github.com/python/cpython/blob/master/Modules/main.c#L382 We *don't* currently have anything like that for environment variables, not even the ones which are "read once at startup, then never read them again". |
Speaking of surprises with static initialization of the runtime allocations: both PRs are failing in CI, suggesting that the changes that Py_Initialize makes to the allocator settings aren't being reverted in Py_Finalize, so there's a mismatch between the allocation function and the deallocation function. |
I changed Py_Main() in bpo-32030. Now multiple environment variables are read once at startup and put into _PyCoreConfig: Lines 1365 to 1413 in 803ddd8
I added new fields to _PyCoreConfig: Lines 27 to 39 in 803ddd8
I suggest to continue to add more fields to _PyCoreConfig to move all code to configure Python before Py_Initialize(), and later to let users embedding Python to configure Python as they want, without losing features. For example, to enable the new "development mode" (-X dev), now you "just" have to set _PyCoreConfig.dev_mode to 1. |
I created a new PR adding a a new _PyCoreConfig.pythonpath field: Once it will be merged, I will work on a new PR for PYTHONHOME (add a new _PyCoreConfig.pythonhome field). |
Victor, I think you're fundamentally misunderstanding the goals of PEP-432. The entire point is to let people have a *mostly working Python runtime* during CPython startup. Moving everything that Py_Initialize needs to instead have to happen before Py_InitializeRuntime (aka _Py_CoreInitialize) defeats that point. CoreConfig should thus contain *as little as possible*, with most of the environmental querying work moving into Py_ReadMainInterpreterConfig. So could you please move everything you've added to CoreConfig (that isn't genuinely required to from the moment the runtime starts doing anything) out again, and either put it into the main interpreter config as Python objects (as described in PEP-432), or else into a new intermediate configuration struct? |
Also, the basic rules of thumb I use for deciding whether or not a setting belongs in CoreConfig:
Everything else goes in MainInterpreterConfig as a real Python object. We may need other structs internally to help manage the way Py_Main populates MainInterpreterConfig, but those should be made a required part of the future public initialization API (although we may decide to expose them as "use them if you want to better emulate CPython's default behaviour" helper APIs). |
Follow up: this also came up on https://bugs.python.org/issue32030#msg306763, and I think Victor and I are on the same page now :) Since MainInterpreterConfig is currently still a private struct, we can store the existing C level config state directly in there for now, and then look at upgrading to Python types on a case by case basis. Once they're all both consolidated *and* upgraded, then we can consider making the new incremental configuration API public. |
"Victor, I think you're fundamentally misunderstanding the goals of PEP-432. The entire point is to let people have a *mostly working Python runtime* during CPython startup. (...)" While the PEP-432 is nice, all changes are currently done in private APIs, symbols starting with _Py. I would prefer that nobody uses these new APIs before the conversion is complete. And from what I saw, I can say that the conversion just started, there are still a lot of changes that should be done. While having _PyRuntime.mem is nice to have in the long term, it doesn't add any value *right now*, except of making the existing C API harder to use. I would prefer to do things in this order:
Maybe we can complete these 3 steps before Python 3.7, but I'm not sure about that. |
Even the public implementation of PEP-432 is going to bound by the requirement to keep existing embedding logic working, and that's encapsulated in the new test Eric added in his PR: wchar_t *program = Py_DecodeLocale("spam", NULL);
Py_SetProgramName(program);
Py_Initialize();
Py_Finalize();
PyMem_RawFree(program); So even if we were to revert the _PyRuntime.mem change in 3.7, we'd still face the same problem in 3.8, because we'd still be exposing the traditional configuration API - the new multi-step configuration API would be *optional* for folks that wanted to override the default settings more easily, rather than a backwards compatibility break with the previously supported way of doing things. As a result, my preferred option is now to make exactly the promises we need to ensure that the above code works correctly, and then have Py_Initialize and Py_Finalize enforce those constraints:
|
The newly added test failed on AMD64 Debian root 3.x: http://buildbot.python.org/all/#/builders/27/builds/226 ====================================================================== Traceback (most recent call last):
File "/root/buildarea/3.x.angelico-debian-amd64/build/Lib/test/test_capi.py", line 602, in test_pre_initialization_api
out, err = self.run_embedded_interpreter("pre_initialization_api", env=env)
File "/root/buildarea/3.x.angelico-debian-amd64/build/Lib/test/test_capi.py", line 464, in run_embedded_interpreter
(p.returncode, err))
AssertionError: -6 != 0 : bad returncode -6, stderr is "Could not find platform independent libraries <prefix>\nCould not find platform dependent libraries <exec_prefix>\nConsider setting $PYTHONHOME to <prefix>[:<exec_prefix>]\nFatal Python error: initfsencoding: Unable to get the locale encoding\nModuleNotFoundError: No module named 'encodings'\n\nCurrent thread 0x00007f6456f8c700 (most recent call first):\n" |
We now check that Py_DecodeLocale() can be called before Py_Initialize(). IMHO we need to document this property in the documentation: I opened bpo-32124 and wrote a PR for that. |
The test also failed on x86 Tiger 3.x: http://buildbot.python.org/all/#/builders/30/builds/212 ====================================================================== Traceback (most recent call last):
File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/test/test_capi.py", line 602, in test_pre_initialization_api
out, err = self.run_embedded_interpreter("pre_initialization_api", env=env)
File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/test/test_capi.py", line 464, in run_embedded_interpreter
(p.returncode, err))
AssertionError: -6 != 0 : bad returncode -6, stderr is "Could not find platform independent libraries <prefix>\nCould not find platform dependent libraries <exec_prefix>\nConsider setting $PYTHONHOME to <prefix>[:<exec_prefix>]\nFatal Python error: initfsencoding: unable to load the file system codec\nModuleNotFoundError: No module named 'encodings'\n\nCurrent thread 0xa000d000 (most recent call first):\n" |
Huh, those crashes are interesting - I'd guess that it means we have a platform-dependent dependency from Py_DecodeLocale on to Py_SetPythonHome in order to locate the encodings module. If I'm right, that dependency would then mean that embedding applications can only rely on Py_DecodeLocale to do "char *" to "wchar_t *" conversions if they can also rely on the locale encoding always being a builtin one that bypasses the search for the encodings module. Perhaps we should be recommending temporarily doing 'setenv("PYTHONHOME", home)' (and then reverting that after calling Py_Initialize so it doesn't get inherited by subprocesses) as the preferred approach to handling platforms with "char *" based native filesystem APIs, and adding such a setting to that particular |
The test calls Py_SetProgramName(). IMHO the bug is that the program name I don't think that the bug is triggered by Py_DecodeLocale(). |
Ah, you're right - I forgot about this little hack in the other embedding tests: https://github.com/vstinner/cpython/blob/3fda852ba4d4040657a1b616a1ef60ad437b7845/Programs/_testembed.c#L11 I'll add "./" to the program name in the new test as well, and see if that's enough to make the failing build bots happy. |
Looking more closely at the code, I've realised Victor's right - there's no way for Py_DecodeLocale() to accidentally trigger an attempt to import the "encodings" module. Instead, the error is likely coming from the init_sys_streams step towards the end of the initialization process. The way the embedded test cases are currently being run unfortunately truncated that traceback. Rather than trying to improve the test case error reporting under the scope of this issue, I've instead filed https://bugs.python.org/issue32136 to cover factoring the runtime embedding tests out to their own test file. |
Successful test run on the Debian machine that failed above: And for the macOS Tiger machine: So I think we can call the regression fixed, which is where we wanted to get to before the next alpha release. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: