logging regression with threading + fork are mixed in 3.7.1rc2 (deadlock potential) #80714

gpshead · 2019-04-05T08:15:51Z

BPO	36533
Nosy	@rhettinger, @gpshead, @ned-deily, @ambv, @cagney, @kevans91, @koubaa
PRs	bpo-36533: Reinit logging.Handler locks on fork(). #12704 [3.7] bpo-36533: Reinit logging.Handler locks on fork(). (GH-12704) #13170 bpo-1635741 port _curses_panel to multi-phase init (PEP 489) #21986 Update all Python Cookbook links after migration to GitHub #22205 bpo-40423: Optimization: use close_range(2) if available #22651

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/gpshead'
closed_at = <Date 2019-05-07.20:32:05.698>
created_at = <Date 2019-04-05.08:15:51.465>
labels = ['3.7', '3.8', 'type-bug', 'library', 'release-blocker']
title = 'logging regression with threading + fork are mixed in 3.7.1rc2 (deadlock potential)'
updated_at = <Date 2020-10-11.20:42:06.792>
user = 'https://github.com/gpshead'

bugs.python.org fields:

activity = <Date 2020-10-11.20:42:06.792>
actor = 'kevans'
assignee = 'gregory.p.smith'
closed = True
closed_date = <Date 2019-05-07.20:32:05.698>
closer = 'gregory.p.smith'
components = ['Library (Lib)']
creation = <Date 2019-04-05.08:15:51.465>
creator = 'gregory.p.smith'
dependencies = []
files = []
hgrepos = []
issue_num = 36533
keywords = ['patch', '3.7regression']
message_count = 19.0
messages = ['339472', '339474', '339475', '339480', '339481', '339503', '339505', '339529', '339650', '339688', '339760', '339825', '339863', '339864', '339882', '341543', '341768', '341815', '341816']
nosy_count = 9.0
nosy_names = ['rhettinger', 'gregory.p.smith', 'ned.deily', 'lukasz.langa', 'cagney', 'kevans', 'hugh', 'ebizzinfotech', 'koubaa']
pr_nums = ['12704', '13170', '21986', '22205', '22651']
priority = 'release blocker'
resolution = 'fixed'
stage = 'commit review'
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue36533'
versions = ['Python 3.7', 'Python 3.8']

gpshead · 2019-04-05T08:15:51Z

I'm spawning a dicussion buried in the way too long thread of https://bugs.python.org/issue6721 over here into its own specific issue to treat as a 3.7 release blocker for a rollback or repair decision before 3.7.4.

3b69993

I believe that was released in 3.7.1 is leading to a behavior regression for an application (the Fedora installer's libreswan kvmrunner?). Full details can be found in the messages of the other issue starting with:
https://bugs.python.org/issue6721#msg329474

TL;DR - logging.Handler instances each have their own threading.Rlock.
libreswan implements at least one logging.Handler subclass. That subclass's custom emit() implementation directly calls potentially many other sub-handlers emit() methods. Some of those emit() methods (such as logging.StreamHandler) call flush() which acquires the handler's lock.

So they've got a dependency between these two locks, the first's must be acquired before the second.

But the logging module APIs have no concept of sub-handlers and lock ordering.

I see many flaws with the libreswan code's design (I'm already ignoring the futility of threading + fork) but this still caused a behavior regression in the stable 3.7 release.

(more comments coming as followups to avoid a wall of text with too many topics)

gpshead · 2019-04-05T08:23:10Z

That custom DebugHandler's emit() implementation that calls into one or more sub-handlers suggests that libreswan _might_ be able to fix it in the custom DebugHandler by implementing custom acquire() and release() methods... BUT that is a fundamentally flawed problem: It requires acquiring multiple locks in a single atomic operation OR guaranteeing that they will always be acquired in a single specific order.

Given the logging module's fork-time code maintains no order, even implementing a custom acquire() and release() method in your custom DebugHandler would not be sufficient as sub-handler locks could be acquired first during fork. also, that would be assuming your sub-handlers are not also used directly elsewhere within all possible coexisting logger configurations.

An implementable _gross workaround_ that libreswan or anything else in this boat could implement satisfying that constraint would be to force all potentially related handlers to share the same single RLock instance. :(

gpshead · 2019-04-05T08:41:40Z

Within the logging module we could go beyond using a WeakSet and maintain an ordering. But we'd need to allow a way for Handler subclasses to indicate what order matters; that'd require a new API (not suitable for 3.7)

An ordering itself may be insufficient as code can implement arbitrary Handler code with cyclic graphs of handlers calling directly into other handlers such that no possible order could ever satisfy them all.

This pulls us back into "stop having multiple locks" for any code that creates relationships between handlers. Which is good advice for Handler authors and something we should document if it isn't already.

Side note: We should also document that Handler.emit() as an API MUST be called with the Handler's lock acquired via Handler.acquire() first. That is the undocumented contracted way logging.Logger.handle()'s code always calls into handler emit().... The libreswan code calling other handler's emit() methods without doing that smells wrong. But that has no relation to their current problem: Even if it did, it'd still deadlock, just sooner rather than within the other handler's emit()->flush()->acquire() call chain.

Logging library users are better off never directly calling another handler. Put message onto queues with a dedicated thread processing those into the relevant handlers. Only block waiting for those queued items to have been processed _after_ releasing your own lock if you want to maintain the synchronicity of logging API calls returning only after the message has been handled.

gpshead · 2019-04-05T09:02:35Z

Now to back up: Why was this application using fork() in a threaded application at all? That is a fundamental flaw. Should we be doing work to support things that so far merely _happen_ to work on such broken designs?

Another alternative for the DebugHandler implementation in whatever code implemented it is for it to de-register itself from the logging library's private WeakSet of handler locks to acquire at fork time from its own constructor. We don't have a public API for this so the workaround doing that on 3.7.1 - 3.7.3 would look like:

   def __init__(self, ...):
       super().__init__(...)
       if hasattr(logging, '_at_fork_acquire_release_weakset'):
           logging._at_fork_acquire_release_weakset.discard(self)

This means it'd still have the bug the code already had on all prior to versions of Python before this logging library "acquire the locks before fork" fix went in: Namely if the forked child tries to log it could deadlock due to forking while the inherited logging handler lock held.

gpshead · 2019-04-05T09:02:41Z

A stdlib alternative to this whole mess would be to avoid acquiring the logging locks before fork() as we currently do and just blindly re-initialize all of them afterwards under the assumption that they "can't" be protecting anything in a newly forked child process. Handlers that need specific resource synchronization around fork would then be required to deal with their own os.register_at_fork() calls. (ex: to avoid multiple processes writing to the same file or fd at once)

cagney · 2019-04-05T15:11:07Z

On Fri, 5 Apr 2019 at 04:15, Gregory P. Smith <report@bugs.python.org> wrote:

New submission from Gregory P. Smith <greg@krypto.org>:

I'm spawning a dicussion buried in the way too long thread of https://bugs.python.org/issue6721 over here into its own specific issue to treat as a 3.7 release blocker for a rollback or repair decision before 3.7.4.

3b69993

I believe that was released in 3.7.1 is leading to a behavior regression for an application (the Fedora installer's libreswan kvmrunner?). Full details can be found in the messages of the other issue starting with:
https://bugs.python.org/issue6721#msg329474

Two separate applications have tripped up on this:

Fedora's installer aka anaconda

I don't know any further details.

libreswan's test framework aka kvmrunner

kvmrunner uses pexpect
pexpect uses ptyprocess
ptyprocess uses pty.fork() + os.exec*(); but, hey, who knew! I didn't
until this week.
this seems to be a long standing concern:
pexpect/ptyprocess#43

TL;DR - logging.Handler instances each have their own threading.Rlock.
libreswan implements at least one logging.Handler subclass. That subclass's custom emit() implementation directly calls potentially many other sub-handlers emit() methods. Some of those emit() methods (such as logging.StreamHandler) call flush() which acquires the handler's lock.

# Implement log-level inversion.

# Ref: https://docs.python.org/2/howto/logging.html#logging-flow

# By default, a parent (root) logger, regardless of its log-level,
# will log all the records logged by a child. For instance, if a
# child logger is logging at DEBUG-level, then the parent will log
# (display on the console) those DEBUG-level records even when it has
# been configured to log only INFO-level records. This is because the
# default log-level ("Logger enabled for level of call?") check is
# only applied once at record-creation.

# This code allows DEBUG-level logging to a file, while simultaneously
# (the inversion) restricting console log records to just INFO-level
# say.

The logging.Handler's lock ensures that the two sub-streams are kept
in ordered. It prevents threads interleaving their writes.

So they've got a dependency between these two locks, the first's must be acquired before the second.

But the logging module APIs have no concept of sub-handlers and lock ordering.

I see many flaws with the libreswan code's design (I'm already ignoring the futility of threading + fork) but this still caused a behavior regression in the stable 3.7 release.

Always first understand the problem.

(more comments coming as followups to avoid a wall of text with too many topics)

----------
assignee: gregory.p.smith
components: Library (Lib)
keywords: 3.7regression
messages: 339472
nosy: cagney, gregory.p.smith, ned.deily, vstinner
priority: release blocker
severity: normal
status: open
title: logging regression with threading + fork are mixed in 3.7.1rc2 (deadlock potential)
type: behavior
versions: Python 3.7

Python tracker <report@bugs.python.org>
<https://bugs.python.org/issue36533\>

cagney · 2019-04-05T16:05:36Z

On Fri, 5 Apr 2019 at 05:02, Gregory P. Smith <report@bugs.python.org> wrote:

Gregory P. Smith <greg@krypto.org> added the comment:

A stdlib alternative to this whole mess would be to avoid acquiring the logging locks before fork() as we currently do and just blindly re-initialize all of them afterwards under the assumption that they "can't" be protecting anything in a newly forked child process. Handlers that need specific resource synchronization around fork would then be required to deal with their own os.register_at_fork() calls. (ex: to avoid multiple processes writing to the same file or fd at once)

FYI, below is a simpler, but related, test.
_at_fork_acquire_release_weakset doesn't seem to be locked:

Ignoring exception from logging atfork <Handler (NOTSET)> release
method: cannot release un-acquired lock

Exception ignored in: <function _after_at_fork_weak_calls at 0x7f7307550378>
Traceback (most recent call last):
  File "/home/python/v3.7.3/lib/python3.7/logging/__init__.py", line
269, in _after_at_fork_weak_calls
    _at_fork_weak_calls('release')
  File "/home/python/v3.7.3/lib/python3.7/logging/__init__.py", line
254, in _at_fork_weak_calls
    for instance in _at_fork_acquire_release_weakset:
  File "/home/python/v3.7.3/lib/python3.7/_weakrefset.py", line 60, in __iter__
    for itemref in self.data:
RuntimeError: Set changed size during iteration

Exception in thread Thread-1:
Traceback (most recent call last):
  File "/home/python/v3.7.3/lib/python3.7/threading.py", line 917, in
_bootstrap_inner
    self.run()
  File "/home/python/v3.7.3/lib/python3.7/threading.py", line 865, in run
    self._target(*self._args, **self._kwargs)
  File "./btc.py", line 11, in lockie
    h = logging.Handler()
  File "/home/python/v3.7.3/lib/python3.7/logging/__init__.py", line
824, in __init__
    self.createLock()
  File "/home/python/v3.7.3/lib/python3.7/logging/__init__.py", line
847, in createLock
    _register_at_fork_acquire_release(self)
  File "/home/python/v3.7.3/lib/python3.7/logging/__init__.py", line
250, in _register_at_fork_acquire_release
    _at_fork_acquire_release_weakset.add(instance)
  File "/home/python/v3.7.3/lib/python3.7/_weakrefset.py", line 83, in add
    self._commit_removals()
  File "/home/python/v3.7.3/lib/python3.7/_weakrefset.py", line 56, in
_commit_removals
    discard(l.pop())
IndexError: pop from empty list

import logging
import os
import sys
import threading

def lockie():
    while True:
        # this adds handle to _at_fork_acquire_release_weakset
        #sys.stdout.write(".")
        #sys.stdout.flush()
        h = logging.Handler()

threading.Thread(target=lockie).start()

for n in range(0,100000):
    if n % 100 == 0:
        sys.stdout.write("%d" % n)
        sys.stdout.flush()
    pid = os.fork()
    if pid != 0:
        os.wait()
    else:
        os._exit(0)

gpshead · 2019-04-06T08:03:15Z

Here's a PR that goes the re-initialization route rather than attempting to acquire/release anything other than the single module _lock.

cagney · 2019-04-08T15:32:23Z

I think the only pragmatic solution here is to add an optional parameter to logging.basicConfig() that specifies that the logger should use a single global lock; and then start documenting that thread locks and fork() don't work well together.

And note that this solution is pragmatic, not correct (@dhr, when we discussed this off line, pointed out that since python's using threads then a "correct" solution would be to use some sort of inter-process lock).

For instance, if I were to implement emit() as something contrived like:

    with lock:
        s = record.to-string()
        for i in 1 .. s'length:
            b[i] = s[i]
        for i in 1 .. b'length:
            stdout.write(b[i]).flush()
        b = []

then when fork() breaks 'lock' the guarantee that the code is atomic is also broken:

when the child enters the code 'b' is undefined
the guarantee that log records don't interleave is lost

while a global lock would help mitigate the first case it really isn't a "correct" fix.

gpshead · 2019-04-09T00:44:39Z

The logging library has never guaranteed that it wouldn't interleave or duplicate buffered io output when fork() in involved. We should not start trying to claim that it does. Too complicated and fragile. It can't. Users who want that should implement their own single logging handler that is well behaved in whatever regards they need. there are a ton of options for them.

A basicConfig setting to switch the mode of operation to a single lock instead of per-handler locks could only happen in 3.8 (new feature). I'll ponder if it is even feasible to add that after the re-init change goes in. Even with such a feature, library user authored Handler subclasses would always be free to violate that. We can't protect people from themselves.

cagney · 2019-04-09T15:19:02Z

I pointed out two issues with breaking the locks.

Your response addressed the less important issue:

the guarantee that log records don't interleave is lost

Since the documentation seems to be silent the guarantee (or expectation) is implied - logging systems emit records atomically - if this isn't true across fork() then the exception should be documented.

but it didn't address the more fundamental problem:

when the child enters the code 'b' is undefined

i.e., breaking the lock will allow the child to access data in an undefined state. This will result in either core dumps or hangs (presumably the motivation for the original change was to prevent this?).

For instance, if the fork() happens while a thread is holding the <stdio.h> FILE lock on a log file, then the child trying to access that FILE will hang.

gpshead · 2019-04-10T05:44:21Z

"""
Since the documentation seems to be silent the guarantee (or expectation) is implied - logging systems emit records atomically - if this isn't true across fork() then the exception should be documented.
"""

We can add a note to the docs. As a general rule: documentation being silent means there are zero guarantees. Behavior of the existing implementations over time is the only source of truth.

"""
but it didn't address the more fundamental problem:

when the child enters the code 'b' is undefined

i.e., breaking the lock will allow the child to access data in an undefined state. This will result in either core dumps or hangs (presumably the motivation for the original change was to prevent this?).
"""

nope. the contrived emit() pseudocode from msg339650 never defined b in the first place. that code, if massaged into python syntax would raise an UnboundLocalError no matter who executed it. a fork() from another thread would not change that. There is no concept of undefined/uninitialized state from the Python perspective.

The motivation for the original fix was entirely deadlock avoidance in code that uses fork. It failed and introduced an alternate form of deadlock in code that had been lucky up until that point. Thus my new PR proposing to fix the regression by making that not fail yet still preventing locks from starting out held in the child process.

cagney · 2019-04-10T13:45:47Z

nope. the contrived emit() pseudocode from msg339650 never defined b in the first place. that code, if massaged into python syntax would raise an UnboundLocalError no matter who executed it. a fork() from another thread would not change that. There is no concept of undefined/uninitialized state from the Python perspective.

Wow! It is pseudo code, right? (I figured using Ada like attributes was a bit of a giveaway). Nitpicking it achieves nothing (if you feel that your language needs 'b' pre-defined then, hey!, pre-define it; I'm pretty sure that won't fix the problem of operations such b[i] = 2 not being atomic yet alone thread-safe in Python).

Lets instead focus on my point that breaking the locks won't fix the problem. For instance, above. Or "For instance, if the fork() happens while a thread is holding the <stdio.h> FILE lock on a log file, then the child trying to access that FILE will hang."

cagney · 2019-04-10T13:55:40Z

BTW, non-visible change might be to use a global readers-writer lock where fork() is the writer.

https://en.wikipedia.org/wiki/Readers%E2%80%93writer_lock

gpshead · 2019-04-10T19:08:47Z

The problem i am fixing is merely making Python's logging library not the source of this deadlock because it was not the source in the past. I am solving a regression in CPython behavior between 3.7.0 and 3.7.1 that led to a logging.Handler lock related deadlock in two identified problematic applications.

Breaking the logging.Handler owned locks does solve that unless someone has concrete proof from these specific applications that it does otherwise.

It would be helpful if someone *willing to actually be constructive* would test them with my PR's patch applied to their 3.7.3 interpreter as confirmation.

ambv · 2019-05-06T15:47:10Z

This deadlock is a release blocker for 3.8.0. FTR, I don't consider it blocking alphas and betas.

gpshead · 2019-05-07T16:18:26Z

New changeset 64aa6d2 by Gregory P. Smith in branch 'master':
bpo-36533: Reinit logging.Handler locks on fork(). (GH-12704)
64aa6d2

gpshead · 2019-05-07T20:29:47Z

New changeset 3b4b28e by Gregory P. Smith in branch '3.7':
[3.7] bpo-36533: Reinit logging.Handler locks on fork(). (GH-12704) (GH-13170)
3b4b28e

gpshead · 2019-05-07T20:32:05Z

The regression should be fixed. It'd be helpful if owners of applications that were running into this could test their applications with this specific change.

gpshead added 3.7 (EOL) end of life release-blocker labels Apr 5, 2019

gpshead self-assigned this Apr 5, 2019

gpshead added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Apr 5, 2019

gpshead added the 3.8 only security fixes label Apr 6, 2019

gpshead removed the 3.8 only security fixes label May 7, 2019

gpshead added the 3.8 only security fixes label May 7, 2019

gpshead closed this as completed May 7, 2019

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

logging regression with threading + fork are mixed in 3.7.1rc2 (deadlock potential) #80714

logging regression with threading + fork are mixed in 3.7.1rc2 (deadlock potential) #80714

gpshead commented Apr 5, 2019

gpshead commented Apr 5, 2019

gpshead commented Apr 5, 2019

gpshead commented Apr 5, 2019

gpshead commented Apr 5, 2019

gpshead commented Apr 5, 2019

cagney mannequin commented Apr 5, 2019

cagney mannequin commented Apr 5, 2019

gpshead commented Apr 6, 2019

cagney mannequin commented Apr 8, 2019

gpshead commented Apr 9, 2019

cagney mannequin commented Apr 9, 2019

gpshead commented Apr 10, 2019

cagney mannequin commented Apr 10, 2019

cagney mannequin commented Apr 10, 2019

gpshead commented Apr 10, 2019

ambv commented May 6, 2019

gpshead commented May 7, 2019

gpshead commented May 7, 2019

gpshead commented May 7, 2019

logging regression with threading + fork are mixed in 3.7.1rc2 (deadlock potential) #80714

logging regression with threading + fork are mixed in 3.7.1rc2 (deadlock potential) #80714

Comments

gpshead commented Apr 5, 2019

gpshead commented Apr 5, 2019

gpshead commented Apr 5, 2019

gpshead commented Apr 5, 2019

gpshead commented Apr 5, 2019

gpshead commented Apr 5, 2019

cagney mannequin commented Apr 5, 2019

cagney mannequin commented Apr 5, 2019

gpshead commented Apr 6, 2019

cagney mannequin commented Apr 8, 2019

gpshead commented Apr 9, 2019

cagney mannequin commented Apr 9, 2019

gpshead commented Apr 10, 2019

cagney mannequin commented Apr 10, 2019

cagney mannequin commented Apr 10, 2019

gpshead commented Apr 10, 2019

ambv commented May 6, 2019

gpshead commented May 7, 2019

gpshead commented May 7, 2019

gpshead commented May 7, 2019