msg344772 - (view) |
Author: Phil Frost (Phil Frost) |
Date: 2019-06-05 22:05 |
We've observed instances of Decimal divisions being incorrect a few times per day in a web application serving millions of requests daily. I've been unable to reproduce the issue but have been investigating core dumps which suggest either some issue in the C Python implementation or its execution environment.
Python 2.7.15 on Alpine Linux Docker containers.
A slightly modified decimal.py is attached. It's been modified to check that the answer is approximately correct and dumps core if not. It also has a few local variables added to provide additional insight. I've annotated the file with values from the core dump and my reasoning about what should happen.
The crux of the problem is the loop at line 1389. There are 3 ways to determine how many times this loop executed:
1. the number of zeros removed from coeff
2. the number of times exp is incremented
3. the number of times division_counter is incremented
Oddly, #1 and #3 indicate the loop happened 27 times, while #2 indicates the loop happened 28 times.
One possible explanation (which makes about as much sense as any other) is that `exp += 1` sometimes adds 2 instead of 1. This means the loop happens 1 time fewer than it should, leaving `coeff` 10 times bigger than it should properly be. Or (very rarely) this happens twice and the result is 100 times bigger.
I find it very odd that something as basic as `+= 1` should not work, but at the moment it is the best explanation I have.
Unfortunately I can't share the core dump, as I've only been able to observe the issue in production so the core contains private information. But I'd welcome any ideas for further exploration. Or perhaps someone is aware of an existing bug that would explain this behavior.
|
msg344774 - (view) |
Author: Phil Frost (Phil Frost) |
Date: 2019-06-05 22:36 |
I should note that while we do observe this issue with Decimal, my troubleshooting so far indicates the logic in decimal.py is fine. It seems instead the instructions in decimal.py aren't being correctly executed.
I'm not sure if the problem is limited to only Decimal, or if we just notice it there because it's correlated with financial code, which when encountering the bug, introduces an error which is eventually caught by the accounting department. There could very well be errors happening at least frequently elsewhere which simply go unnoticed.
|
msg344777 - (view) |
Author: Stefan Krah (skrah) * |
Date: 2019-06-05 23:02 |
Thank you for the detailed report. So, in an attempt to summarize,
this is the crucial passage:
>>> ideal_exp = -1
>>> exp = -29
>>> coeff = 150000000000000000000000000000
>>> division_counter = 0
>>> while exp < ideal_exp and coeff % 10 == 0:
... coeff //= 10
... exp += 1
... division_counter += 1
...
>>> coeff
15L
>>> exp
-1
>>> division_counter
28
But you sometimes get coeff==150 and division_counter==27 instead.
Is that correct?
The first thing to consider is whether Alpine Linux uses a patched
Python version or if you can reproduce this on Python 2.7.15 compiled
from source.
|
msg344787 - (view) |
Author: Tim Peters (tim.peters) * |
Date: 2019-06-06 04:58 |
Also basic: run hardware CPU and memory stress diagnostics, and/or try running the same thing on a different machine. Hardware isn't infallible, and can fail in nearly arbitrary ways. For example, perhaps a smidgen of silicon has gone flaky, so that one time in a billion bit 0 "leaks" into bit 1 in a register. That can effectively change the integer 1 to the integer 2.
Is that likely? No. But a loop doing mutually inconsistent things to the variables it's changing is impossible ;-)
|
msg344834 - (view) |
Author: Phil Frost (Phil Frost) |
Date: 2019-06-06 16:31 |
skrah: Yes, that's correct. Since I can only produce this bug in production it will take me some days to build and validate a source build. But absent any better ideas, I will try.
tim.peters: I've observed this bug across hundreds of EC2 hosts, in dozens of code paths, with all kinds of inputs. Moreover, the hosts aren't displaying any other symptoms of hardware failure such as random segfaults or mysteriously corrupted data.
I've also deeply investigated two cores now which show specifically that `exp` seems to get 2 added when it should have been 1. I have a hard time explaining how a hardware failure can cause precisely the same failure so reliably.
So I doubt hardware is to blame.
Although, it does seem the issue occurs in "clumps" on individual hosts. So we might go 10 hours without seeing the issue, then it may happen 5 times within 30 minutes on one host. We might observe 1 or 2 more such clumps on the same host until the next deploy of the application, at which point all the containers are replaced with fresh ones. So this suggests there is some ephemeral state within a host that creates a propensity for the issue.
I've also been unable to reproduce the problem in a development environment, even when that development environment is using the same kernel, instance class, and docker container as production. So I suspect the bug is precipitated by some particular concurrency or interaction that I haven't been able to replicate.
|
msg344839 - (view) |
Author: Stefan Krah (skrah) * |
Date: 2019-06-06 16:49 |
There are lots of ideas still. :)
As I said, on a Linux other than RedHat or Debian or such I'd first try
to blame the shipped Python, then the libc (I had a glibc memmove issue
once with similar inexplicable results).
Alpine Linux apparently uses musl. Is that well supported and tested?
Since you mention concurrency: Before the arrival of contextvars in 3.7
the decimal context is thread-safe, but not coroutine-safe.
If the issue is really in that particular loop I can't see how
that could be the cause, but perhaps the core dump is not accurate.
|
msg344844 - (view) |
Author: Phil Frost (Phil Frost) |
Date: 2019-06-06 17:21 |
> Alpine Linux apparently uses musl. Is that well supported and tested?
Heh. Not really.
Switching to a Debian container is probably just about as much work as building Python from source. Would that be preferable?
|
msg344845 - (view) |
Author: Stefan Krah (skrah) * |
Date: 2019-06-06 17:29 |
Yes, I'd try Debian first, then potential libc issues can also be eliminated in a single experiment.
|
msg344847 - (view) |
Author: Phil Frost (Phil Frost) |
Date: 2019-06-06 18:04 |
So this is um....interesting....
(gdb) p small_ints[0]
$2 = (PyIntObject *) 0x558ce65df528
(gdb) p *small_ints[0]
$3 = {ob_refcnt = 18, ob_type = 0x7fe019b694c0 <PyInt_Type>, ob_ival = -5}
(gdb) p *small_ints[1]
$4 = {ob_refcnt = 65, ob_type = 0x7fe019b694c0 <PyInt_Type>, ob_ival = -3}
(gdb) p *small_ints[2]
$5 = {ob_refcnt = 204, ob_type = 0x7fe019b694c0 <PyInt_Type>, ob_ival = -3}
(gdb) p *small_ints[3]
$6 = {ob_refcnt = 1872, ob_type = 0x7fe019b694c0 <PyInt_Type>, ob_ival = -2}
When is 1/1=10? When -5 + 1 = -3.
So now I suppose the question is how this comes to be.
|
msg344853 - (view) |
Author: Mark Dickinson (mark.dickinson) * |
Date: 2019-06-06 19:10 |
Interesting indeed.
What 3rd party extension are installed in your Python environment? Is it possible that one of them is messing with the CPython internals (perhaps in an overeager attempt to optimize)? Any uses of ctypes?
|
msg344854 - (view) |
Author: Mark Dickinson (mark.dickinson) * |
Date: 2019-06-06 19:14 |
This smells a bit as though someone's doing something like:
myobject = PyInt_FromLong(some_long);
and then assuming that they hold the only reference to `myobject`, so that it's safe to mutate in place.
|
msg344878 - (view) |
Author: Phil Frost (Phil Frost) |
Date: 2019-06-06 21:40 |
Probably too many C extensions to feasibly audit them all. Also we can't rule out something busted in Alpine. So I'm going to set a watchpoint on small_ints[1]->ob_ival with a little gdb script to dump core right when it changes; hopefully inspecting that backtrace will lead directly to the culprit.
|
msg344888 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2019-06-06 23:28 |
On Python 3, you can use tracemalloc.get_object_traceback() to find where an object was allocated ;-)
|
msg344987 - (view) |
Author: Phil Frost (Phil Frost) |
Date: 2019-06-07 18:10 |
I'm afraid that won't be of much use since the object is statically allocated. Besides, the trouble isn't finding what created the object, but what mutated an int object which should be immutable.
|
msg345007 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2019-06-07 20:43 |
Can you try to reproduce your workflow with Python compiled in debug mode (./configure --with-pydebug)? With Python 3.8, you can now use a debug Python without having to recompile your C extensions :-)
|
msg347567 - (view) |
Author: Mark Dickinson (mark.dickinson) * |
Date: 2019-07-09 17:02 |
@Phil: Did you ever get to the bottom of this?
|
msg347626 - (view) |
Author: Phil Frost (Phil Frost2) |
Date: 2019-07-10 15:58 |
Not yet. I had a hell of a time getting gdb to follow gunicorn forks so it would watch the workers but not also every other thing using popen or os.system(). And the lldb package on Alpine doesn't seem to work.
So we're currently in the process of testing a Debian container for production, hoping the issue just goes away. If not, I'll continue digging in with a debugger.
|
msg376123 - (view) |
Author: Stefan Krah (skrah) * |
Date: 2020-08-30 19:08 |
If nothing has been diagnosed, I suggest closing this.
|
msg376124 - (view) |
Author: Tim Peters (tim.peters) * |
Date: 2020-08-30 19:27 |
Closing, since it remains a unique report and there hasn't been another word about it in over a year.
|
|
Date |
User |
Action |
Args |
2022-04-11 14:59:16 | admin | set | github: 81349 |
2020-08-30 19:27:28 | tim.peters | set | status: pending -> closed resolution: works for me messages:
+ msg376124
stage: resolved |
2020-08-30 19:08:01 | skrah | set | status: open -> pending
messages:
+ msg376123 |
2020-08-30 19:07:08 | skrah | set | title: Decimal divisions sometimes 10x or 100x too large -> small_ints[] modified (third party C-extension?) |
2020-08-30 18:35:24 | facundobatista | set | nosy:
- facundobatista
|
2019-09-14 09:57:36 | steven.daprano | set | nosy:
+ steven.daprano
|
2019-07-10 15:58:02 | Phil Frost2 | set | nosy:
+ Phil Frost2 messages:
+ msg347626
|
2019-07-09 17:02:22 | mark.dickinson | set | messages:
+ msg347567 |
2019-06-07 20:43:46 | vstinner | set | messages:
+ msg345007 |
2019-06-07 18:10:53 | Phil Frost | set | messages:
+ msg344987 |
2019-06-06 23:28:48 | vstinner | set | messages:
+ msg344888 |
2019-06-06 21:40:17 | Phil Frost | set | messages:
+ msg344878 |
2019-06-06 19:14:45 | mark.dickinson | set | messages:
+ msg344854 |
2019-06-06 19:10:42 | mark.dickinson | set | messages:
+ msg344853 |
2019-06-06 18:04:50 | Phil Frost | set | messages:
+ msg344847 |
2019-06-06 18:02:20 | Jeffrey.Kintscher | set | nosy:
+ Jeffrey.Kintscher
|
2019-06-06 17:29:12 | skrah | set | messages:
+ msg344845 |
2019-06-06 17:21:53 | Phil Frost | set | messages:
+ msg344844 |
2019-06-06 16:49:32 | skrah | set | messages:
+ msg344839 |
2019-06-06 16:31:34 | Phil Frost | set | messages:
+ msg344834 |
2019-06-06 06:56:05 | vstinner | set | nosy:
+ vstinner
|
2019-06-06 04:58:09 | tim.peters | set | nosy:
+ tim.peters messages:
+ msg344787
|
2019-06-05 23:02:40 | skrah | set | messages:
+ msg344777 |
2019-06-05 22:36:25 | Phil Frost | set | messages:
+ msg344774 |
2019-06-05 22:13:21 | zach.ware | set | nosy:
+ rhettinger, facundobatista, mark.dickinson, skrah
|
2019-06-05 22:05:21 | Phil Frost | create | |