Message 389034 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	anentropic
Recipients	anentropic
Date	2021-03-18.18:58:25
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1616093906.25.0.990565288145.issue43546@roundup.psfhosted.org>
In-reply-to

Content
We have a Django 2.2.19 project on Python 3.9.2 on Debian (slim-buster) in Docker. A bizarre problem started happening to us this week. First I'll show the symptom, we started getting the following error: ... File "/root/.pyenv/versions/3.9.2/lib/python3.9/site-packages/django/db/migrations/autodetector.py", line 10, in <module> from django.db.migrations.optimizer import MigrationOptimizer File "<frozen importlib._bootstrap>", line 1004, in _find_and_load File "<frozen importlib._bootstrap>", line 158, in __enter__ File "<frozen importlib._bootstrap>", line 110, in acquire KeyError: 140426340123264 If I look at the source for _bootstrap.py, that error should be impossible: https://github.com/python/cpython/blob/v3.9.2/Lib/importlib/_bootstrap.py#L110 At the top of the acquire method it does: tid = _thread.get_ident() _blocking_on[tid] = self and then on line 110 where we get the KeyError: del _blocking_on[tid] both `tid` and `_blocking_on` are local vars and none of the other lines in the method touch them So how do we get a KeyError? I can only think that something mutates the underlying value of `tid`, but it's supposed to be an int so that's very weird. I started with the symptom because our context for this is complicated to explain. I did find a fix that prevents the error but I do not understand the link between cause and effect. Our context: - we have a large unit test suite for the project which we run in Jenkins - we split the tests across several Jenkins nodes to run in parallel in isolated docker environments - we use some bash to like this to split the test cases: find project/ -iname "test.py" -print0 \| \ xargs --null grep -E '(def test)\|(def step_)' -l \| \ split -n "r/$NODE_ID/$NODES" \| \ xargs ci/bin/run-tests - ci/bin/run-tests is just a wrapper which calls Django's manage.py test command so it receives a list of filenames like "project/metrics/tests/test_client.py" as args - using "nose" test runner via django-nose FWIW We currently split tests across 3 nodes, and it was always node 2 which would fail. I found that commenting out a test case in any of the files being passed to node 2 would prevent the error from occurring. Note that in this case we are still passing exactly the same filenames* as cli args to the test runner. Splitting the tests across 4 nodes instead of 3 also seems to prevent the error. So it seems like, in some way I don't understand, we just have too many test cases. Perhaps nose is doing something wrong or inefficient when given lots of filenames. But I'm reporting here because the error we get from importlib._bootstrap looks like it should be impossible.

We have a Django 2.2.19 project on Python 3.9.2 on Debian (slim-buster) in Docker.

A bizarre problem started happening to us this week.

First I'll show the symptom, we started getting the following error:

...
  File "/root/.pyenv/versions/3.9.2/lib/python3.9/site-packages/django/db/migrations/autodetector.py", line 10, in <module>
    from django.db.migrations.optimizer import MigrationOptimizer
  File "<frozen importlib._bootstrap>", line 1004, in _find_and_load
  File "<frozen importlib._bootstrap>", line 158, in __enter__
  File "<frozen importlib._bootstrap>", line 110, in acquire
KeyError: 140426340123264

If I look at the source for _bootstrap.py, that error should be impossible:
https://github.com/python/cpython/blob/v3.9.2/Lib/importlib/_bootstrap.py#L110

At the top of the acquire method it does:

    tid = _thread.get_ident()
    _blocking_on[tid] = self

and then on line 110 where we get the KeyError:

    del _blocking_on[tid]

both `tid` and `_blocking_on` are local vars and none of the other lines in the method touch them

So how do we get a KeyError?

I can only think that something mutates the underlying value of `tid`, but it's supposed to be an int so that's very weird.

I started with the symptom because our context for this is complicated to explain. I did find a fix that prevents the error but I do not understand the link between cause and effect.

Our context:
- we have a large unit test suite for the project which we run in Jenkins
- we split the tests across several Jenkins nodes to run in parallel in isolated docker environments
- we use some bash to like this to split the test cases:
  find project/ -iname "test*.py" -print0 | \
    xargs --null grep -E '(def test)|(def step_)' -l | \
    split -n "r/$NODE_ID/$NODES" | \
    xargs ci/bin/run-tests
- ci/bin/run-tests is just a wrapper which calls Django's manage.py test command
  so it receives a list of filenames like "project/metrics/tests/test_client.py" as args
- using "nose" test runner via django-nose FWIW

We currently split tests across 3 nodes, and it was always node 2 which would fail.
I found that commenting out a test case in any of the files being passed to node 2 would prevent the error from occurring.
Note that in this case we are still passing *exactly the same filenames* as cli args to the test runner.

Splitting the tests across 4 nodes instead of 3 also seems to prevent the error.
So it seems like, in some way I don't understand, we just have too many test cases.
Perhaps nose is doing something wrong or inefficient when given lots of filenames.

But I'm reporting here because the error we get from importlib._bootstrap looks like it should be impossible.

History
Date	User	Action	Args
2021-03-18 18:58:26	anentropic	set	recipients: + anentropic
2021-03-18 18:58:26	anentropic	set	messageid: <1616093906.25.0.990565288145.issue43546@roundup.psfhosted.org>
2021-03-18 18:58:26	anentropic	link	issue43546 messages
2021-03-18 18:58:25	anentropic	create