New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dtoa.c: oversize b in quorem, and a menagerie of other bugs #51881
Comments
In a debug build: Python 3.2a0 (py3k:76671M, Dec 22 2009, 19:41:08)
[GCC 4.1.3 20080623 (prerelease) (Ubuntu 4.1.2-23ubuntu3)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> s = "2183167012312112312312.23538020374420446192e-370"
[30473 refs]
>>> f = float(s)
oversize b in quorem |
Nice catch! I'll take a look. We should find out whether this is something that happens with Gay's original code, or whether it was introduced in the process of adapting that code for Python. |
I can reproduce this on OS X 10.6 (64-bit), both in py3k and trunk debug builds. In non-debug builds it appears to return the correct result (0.0), so the oversize b appears to have no ill effects. So this may just be an overeager assert; it may be a symptom of a deeper problem, though. |
I'm testing on a Fedora Core 6 i386 box and an Intel Mac 32-bit 10.5 box. I only see this on debug builds. I've tested trunk, py3k, release31-maint, and release26-maint (just for giggles). The error shows up in debug builds of trunk, py3k, and release31-maint on both machines, and does not show up in non-debug builds. |
The bug is present in the current version of dtoa.c from http://www.netlib.org/fp, so I'll report it upstream. As far as I can tell, though, it's benign, in the sense that if the check is disabled then nothing bad happens, and the correct result is eventually returned (albeit after some unnecessary computation). I suspect that the problem is in the if block around lines 1531--1543 of Python/dtoa.c: a subnormal rv isn't being handled correctly here---it should end up being set to 0.0, but is instead set to 2**-968. |
Here's a patch that seems to fix the problem; I'll wait a while to see if I get a response from David Gay before applying this. Also, if we've got to the stage of modifying the algorithmic part of the original dtoa.c, we should really make sure that we've got our own set of comprehensive tests. |
Randomised testing quickly turned up another troublesome string for str -> float conversion: s = "94393431193180696942841837085033647913224148539854e-358" This one's actually giving incorrectly rounded results (the horror!) in a non-debug build of trunk, and giving the same 'oversize b in quorem' in a debug build. With the patch, it doesn't give the 'oversize b' error, but does still give incorrect results. Python 2.7a1+ (trunk:77375, Jan 8 2010, 20:33:59)
[GCC 4.2.1 (Apple Inc. build 5646) (dot 1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> s = "94393431193180696942841837085033647913224148539854e-358"
>>> float(s) # result of dtoa.c
9.439343119318067e-309
>>> from __future__ import division
>>> int(s[:-5])/10**358 # result via (correctly rounded) division
9.43934311931807e-309 I also double checked this value using a simple pure Python implementation of strtod, and using MPFR (via the Python bigfloat module), with the same result: >>> from test_dtoa import strtod
>>> strtod(s) # result via a simple pure Python implementation of strtod
9.43934311931807e-309
>>> from bigfloat import *
>>> with double_precision: x = float(BigFloat(s))
>>> x # result from MPFR, via the bigfloat module
9.43934311931807e-309 |
Okay, I think I've found the cause of the second rounding bug above: at the end of the bigcomp function there's a correction block that looks like ...
else if (dd < 0) {
if (!dsign) /* does not happen for round-near */
retlow1:
dval(rv) -= ulp(rv);
}
else if (dd > 0) {
if (dsign) {
rethi1:
dval(rv) += ulp(rv);
}
}
else ... The problem is that the += and -= corrections don't take into account the possibility that bc->scale is nonzero, and for the case where the scaled rv is subnormal, they'll typically have no effect. I'll work on a fix... tomorrow. |
Second patch, adding a fix for the rounding bug to the first patch. |
Here's the (rather crude) testing program that turned up these errors. |
One more incorrectly rounded result, this time for a normal number: AssertionError: Incorrectly rounded str->float conversion for 99999999999999994487665465554760717039532578546e-47: expected 0x1.0000000000000p+0, got 0x1.fffffffffffffp-1 |
Showing once again that a proof of FP code correctness is about as compelling as a proof of God's ontological status ;-) Still, have to express surprised admiration for 99999999999999994487665465554760717039532578546e-47! That one's not even close to being a "hard" case. |
Clearly we need a 1000-page Isabelle/HOL-style machine-checked formal proof, rather than a ten-page TeX proof. Any takers? All of the above bugs seem to have been introduced with the new 'bigcomp' code that arrived on March 16, 2009, just a couple of weeks before I downloaded the version that got adapted for Python; in retrospect, I probably should have used the NO_STRTOD_BIGCOMP #define to bypass the new code. |
Progress report: I've had a response, and fix, from David Gay for the first 2 bugs (Stefan's original bug and the incorrect subnormal result); I'm still arguing with him about a 3rd one (not reported here; there's some possibly incorrect code in bigcomp that probably never actually gets called). I reported the 4th bug (the incorrect rounding for values near 1) to him today. In the mean time, here's bug number 5, found by eyeballing the bigcomp code until it surrendered. :-) >>> 1000000000000000000000000000000000000000e-16
1e+23
>>> 10000000000000000000000000000000000000000e-17
1.0000000000000001e+23 |
Fixed the crash that Stefan originally reported in r77450. That revision also removes the 'possibly incorrect code in bigcomp that probably never actually gets called'. |
Second bug fixed in r77451 (trunk), using a fix from David Gay, modified slightly for correctness. |
Merged fixes so far, and a couple of other cleanups, to py3k in r77452, and release31-maint in r77453. |
Just so I don't forget, there are a couple more places in the dtoa.c that look suspicious and need to be checked; I haven't tried to generate failures for them yet. Since we're up to bug 5, I'll number these 6 and 7: (6) at the end of bigcomp, when applying the round-to-even rule for halfway cases, the lsb of rv is checked. This looks wrong if bc->scale is nonzero. (7) In the main strtod correction loop, after computing delta and i, there's a block: bc.nd = nd;
i = -1; /* Discarded digits make delta smaller. */ This logic seems invalid if all the discarded digits are zero. (This is the same logic error as is causing bug5: the bigcomp comparison code also assumes incorrectly that digit nd-1 of s0 is nonzero.) |
Bug 6 is indeed a bug: an example incorrectly-rounded string is: It's fixed in r77491. I'll add tests once the remaining (known) dtoa.c bugs are fixed. |
Bug 4 fixed in r77492. This just leaves bugs 5 and 7; I have a fix for these in the works. |
Tests committed in r77493. |
Fixes and tests so far merged to py3k in r77494, release31-maint in r77496. |
I was considering downgrading this to 'normal'. Then I found Bug 8, and it's a biggie: >>> 10.900000000000000012345678912345678912345
10.0 Now I'm thinking it should be upgraded to release blocker instead. The cause is in the _Py_strtod block that starts: 'if (nd > STRTOD_DIGLIM) {'... It truncates the input to 18 digits, and then deletes trailing zeros. But the code that deletes the zeros is buggy, and passes over the digit '9' just before the point. |
Mark, I agree that last one should be a release blocker -- it's truly dreadful. BTW, did you guess in advance just how many bugs there could be in this kind of code? I did ;-) |
Upgrading to release blocker. It'll almost certainly be fixed before the weekend is out. (And I will, of course, report it upstream.) |
Here's a patch for the release blocker. Eric, would you be interested in double checking the logic for this patch? Tim: No, I have to admit I didn't forsee quite this number of bugs. :) |
issue7632_bug8.patch uploaded to Rietveld: |
It looks correct to me, assuming this comment is correct:
I didn't verify the comment itself. |
I have a few minor comments posted on Rietveld, but nothing that would keep you from checking this in. |
Applied the bug 8 patch in r77519 (thanks Eric for reviewing!). For safety, I'll leave this as a release blocker until fixes have been merged to py3k and release31-maint. I've uploaded a fix for bugs 5 and 7 to Rietveld: http://codereview.appspot.com/186182 I still don't like the parsing code much: I'm tempted to pull out the calculation of y and z and do it after the parsing is complete. It's probably marginally less efficient that way, but it would help make the code clearer. |
I've applied a minimal fix for bugs 5 and 7 in r77530 (trunk). (I wasn't able to produce any strings that trigger bug 7, so it may not technically be a bug.) I'm continuing to review, comment, and clean up the remainder of the _Py_dg_strtod. |
Fixes merged to py3k and release31-maint in r77535 and r77537. |
One of the buildbots just produced a MemoryError from test_strtod: http://www.python.org/dev/buildbot/all/builders/i386%20Ubuntu%203.x/builds/411 It looks as though there's a memory leak somewhere in dtoa.c. It's a bit difficult to tell, though, since the memory allocation functions in that file deliberately hold on to small pieces of memory. |
Okay, so there's a memory leak for overflowing values: if an overflow is detected in the main correction loop of _Py_dg_strtod, then 'references' to bd0, bd, bb, bs and delta aren't released. There may be other leaks; I'm trying to come up with a good way to detect them reliably. |
This is what Valgrind complains about: ==4750== 3,456 (1,440 direct, 2,016 indirect) bytes in 30 blocks are definitely lost in loss record 3,302 of 3,430 ==4750== 9,680 bytes in 242 blocks are still reachable in loss record 3,369 of 3,430 ==4750== 270,720 bytes in 1,692 blocks are indirectly lost in loss record 3,423 of 3,430 ==4750== 382,080 bytes in 2,388 blocks are indirectly lost in loss record 3,424 of 3,430 ==4750== 414,560 bytes in 2,591 blocks are indirectly lost in loss record 3,425 of 3,430 ==4750== 414,960 (414,768 direct, 192 indirect) bytes in 2,604 blocks are definitely lost in loss record 3,426 of 3,430 ==4750== 890,720 (532,960 direct, 357,760 indirect) bytes in 3,331 blocks are definitely lost in loss record 3,428 of 3,430 ==4750== 1,021,280 (566,080 direct, 455,200 indirect) bytes in 3,538 blocks are definitely lost in loss record 3,429 of 3,430 ==4750== 1,465,280 (676,640 direct, 788,640 indirect) bytes in 4,229 blocks are definitely lost in loss record 3,430 of 3,430 |
Stefan, thanks for that! I'm not entirely sure how to make use of it, though. Is there a way to tell Valgrind that some leaks are expected? The main problem with leak detection is that dtoa.c deliberately keeps hold of any malloc'ed chunks less than a certain size (which I think is something like 2048 bytes, but I'm not sure). These chunks are never freed in normal use; instead, they're added to a bunch of free lists for the next time that strtod or dtoa is called. The logic isn't too complicated: it's in the functions Balloc and Bfree in dtoa.c. So the right thing to do is just to check that for each call to strtod, the total number of calls to Balloc matches the total number of calls to Bfree with non-NULL argument. And similarly for dtoa, except that in that case one of the Balloc'd blocks gets returned to the caller (it's the caller's responsibility to call free_dtoa to free it when it's no longer needed), so there should be a difference of 1. And there's one further wrinkle: dtoa.c maintains a list of powers of 5 of the form 5**2**k, and this list is automatically extended with newly allocated Bigints when necessary: those Bigints are never freed either, so calls to Balloc from that source should be ignored. Another way round this is just to ignore any leak from the first call to strtod, and then do a repeat call with the same parameters; the second call will already have all the powers of 5 it needs. |
Upgrading to release blocker again: the memory leak should be fixed for 2.7 (and more immediately, for 3.1.2). |
Mark, thanks for the explanation! - You can generate suppressions for the Misc/valgrind-python.supp file, but you have to know exactly which errors can be ignored. Going through the Valgrind output again, it looks like most of it is about what you already mentioned (bd0, bd, bb, bs and delta not being released). Would it be much work to provide Valgrind-friendly versions of Balloc, Bfree and pow5mult? Balloc and Bfree are already mentioned in an XXX |
Stefan, I'm not particularly familiar with Valgrind: can you tell me what would need to be done? Is a non-caching version of pow5mult all that's required? Here's the patch that I'm using to detect leaks at the moment. (It includes a slow pow5mult version.) |
With the latest dtoa.c, your non-caching pow5mult and a quick hack for Balloc and Bfree I get zero (dtoa.c related) Valgrind errors. So the attached memory_debugger.diff is pretty much all what's needed for Valgrind. |
Thanks, Stefan. Applied in r77589 (trunk), r77590 (py3k), r77591 (release31-maint) with one small change: I moved the freelist and p5s declarations inside the #ifndef Py_USING_MEMORY_DEBUGGER conditionals. The leak itself was fixed in revisions r77578 through r77580; from Stefan's Valgrind report, and my own refcount testing, it looks as though that was the only leak point. I haven't finished reviewing/testing the _Py_dg_strtod code yet, but I'm going to close this issue; if anything new turns up I'll open another one. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: