New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve quality of Python/dtoa.c #53255
Comments
I've opened this issue to track efforts to improve the quality of the Python/dtoa.c file, which provides Python's string-to-float and float-to-string conversions. Particular issues in mind (non-exhaustive):
|
Created new 'py3k-dtoa' branch for this work in r82024. |
A couple of preparatory commits: r82025: In _Py_dg_strtod, 'e' now represents the adjusted exponent rather than the base exponent; that is, the input value is of the form +- m * 10**e with 0.1 <= m < 1.0. It's easier to produce such an 'e' from the parsing stage if we care about detecting overflow and underflow. r82031: Update the s2b function: remove a premature optimization in order to make s2b more general and its correctness more easily verifiable; alter the way that the input string is parsed so that it doesn't depend on nd0 being in the range [0, nd]. |
And here's a patch to pull out the parsing stage of _Py_dg_strtod into a separate function. |
r82032: Commit some additional tests for test_strtod.py. test_extra_long_significand will currently fail; with the dtoa_parsing patch, it passes. |
I realize the ship's already sailed on this issue [1], but isn't it a problem that editing this code makes it more difficult to apply patches from Gay's original code? What do we do if Gay releases a new version of his code with bug fixes? [1] Sorry non-native American-English speakers. I mean that we already have this problem since we've already modified the code. |
Gay's changes tend to be very small; any bugfixes he releases can likely be applied by hand, if they're relevant. I did originally want to keep close to Gay's code, but frankly I'm not very happy with the quality of that code; and in communications with Gay it's become clear that there are issues that will not be fixed upstream, but that I consider unacceptable for Python's copy of dtoa.c. Some examples of problems with the original code: (1) Gay's code does no checking of return values from malloc. We had to add those checks, which was the first point at which our code started diverging from his. (2) There's a segment at the beginning of the bigcomp function that's unnecessary, and in fact would produce incorrect results if it were ever called; it's just about possible to show that it *can't* ever be called. I've asked David Gay about this, but he insists that it's necessary. (I've removed it in Python's version of the code.) (3) The original code silently produces wrong results for huge inputs (more than 20000 characters); I know this is an extreme use case, but again I find this unacceptable for Python. (I haven't asked Gay about this; I'd be very surprised if he wanted to do anything about it, though.) (4) The original code is horribly convoluted in places, making it very difficult to check for correctness. (For example, see the spaghetti mess in the parsing section of strtod.c). |
Here's the section of the 'bigcomp' code that I was referring to above:
if (!(dig = quorem(b,d))) {
b = multadd(b, 10, 0); /* very unlikely */
dig = quorem(b,d);
} You can see it in the original source at http://www.netlib.org/fp/dtoa.c This code is part of the algorithm for strtod. Here b and d are Bigints, and b / d is a fraction that gives an approximation to the value of the input to strtod; the aim is to produce the digits of b / d one-by-one to compare them with the strtod input, and (eventually) use the result of that comparison work out whether to round up or down. If the condition of the 'if' block above is ever satisfied, b is multiplied by 10 (that's the multadd(b, 10, 0) call), so the fraction b / d is multiplied by 10 (with no corresponding correction for the strtod input string), and the wrong comparison is made! There are many other similar pieces of code in _Py_dg_strtod that can never get called (I've run coverage programs over the code to help verify this); trying to establish the correctness of the current code isn't easy. |
Just to be clear: I'm okay with this divergence, as long as we've made it |
Here's an updated version of the parsing patch, with rewritten comments, but no significant code changes. |
Mark, It is great to see you doing this. I looked at this code on several occasions before and each time ran away scared! I sincerely hope I will understand how it works after your rewrite. Just a small suggestion at this point: can you give longer names to args and local variables? The current alphabet soup is a bit confusing: c, s, s0, s1, se, s00(!), e, nd, nd0, pnd, pnd0 ... |
r82079: Apply a version of the parsing patch to pull the parsing code Alexander, I agree about the names; I'll do a mass renaming later on. I'm trying not to mix the significant algorithm-changing commits with trivial renaming/reindenting/commenting commits, to make it easier to review each independent change. |
It would be easier for me to review if you did it in the other order: fix the variable names first. Although I'm still pretty busy and won't have much time to review, so my needs shouldn't be your first priority. |
r82080: Whitespace fixes. r82087: Simplify the ratio function. The previous ratio function (actually, b2d), aborted if the numerator was zero, and the current code ends up requiring special cases for zero as a result of this. That restriction is now removed, which will allow further simplifications (to come) in strtod. |
Patch that does a fairly significant rewrite of strtod; it's still (mostly) based on Gay's code, but there are some significant changes. Some background: strtod, after dealing with easy cases, works roughly as follows: (1) Using floating-point arithmetic, create a double *rv* holding an approximation to the input value; this approximation may be out from the true value by several ulps (perhaps as much as 10 ulps; certainly not more than 100 ulps). (2) If the input string is very long, truncate it (accepting that this introduces a small error), and work with the truncated value below. (3) Use integer arithmetic to compute (an approximation to) ulps difference between *rv* and true value. Possibly return immediately if (4) Use the ulps difference to correct *rv* to a new value. (5) If the ulps difference has fractional part close to 0.5, or if the correction takes us past a power of 2, or if it takes use near/to the max representable double, or to 0.0, go around the correction loop again. (6) If we still can't decide (because the ulps difference is very close to 0.5), call bigcomp to settle the issue once and for all. The new patch simplifies the above procedure considerably:
The patch isn't quite ready to apply; I want to expand some of the comments a little. |
Another issue to consider, brought to my attention by Rick Regan: Python's dtoa.c code likely misbehaves (i.e., returns incorrectly rounded results) if the FPU rounding mode is anything other than the round-half-to-even default. (This is also true of Gay's original dtoa.c, I suspect.) For example, the quick-return path in strtod does a single floating-point operation on exact arguments, so will end up behaving according to the FPU rounding mode. The long integer-arithmetic-based path will likely return round-half-to-even results, independently of the FPU rounding mode. It's debatable what Python should do if the FPU rounding mode is something other than round-half-to-even. It can either:
I'd prefer the latter behaviour, for various reasons:
It seems possible that Python might one day want to be able to control the rounding direction of decimal <-> binary conversions, but when that day comes I don't think playing with the FPU rounding mode is going to be the best control mechanism. I don't regard this as a terribly serious issue, though: most normal users are unlikely to end up in a situation where the FPU rounding mode has changed unless they've been deliberately and knowingly messing with FPU settings. |
|
Agreed. +1 |
+1, if the FPU mask is always restored (as I understand, this is the case |
Second version of the strtod rewrite; has some additional documentation and comment fixes. No other significant changes from the first version. This is still a work in progress. |
r82614: add functionality to change FPU rounding mode (via float.__setround__ and float.__getround__ functions), on platforms that support the standard C99 fesetround and fegetround functions: >>> float.__getround__()
'tonearest'
>>> 1e300 * 1e300
inf
>>> float.__setround__("downward")
>>> 1e300 * 1e300
1.7976931348623157e+308 This is just temporary, so that I can test that FPU rounding mode doesn't affect results of string-to-float and float-to-string conversions. I'm not planning to merge any of r82614 back to py3k. |
Version of the rewrite_strtod patch applied in r83813. |
Mark, I think I know the motivation for this code, although I still don't know how it could hit. The halfway value H is scaled by a power of ten to put it in the form "d1.d2d3d4d5...". The power of ten exponent is derived from the input decimal string S, instead of computing it from H using logarithms. Now what if H's exponent does not match S's? I'm thinking of cases like S = 10^n and H = 9.99999999... * 10^(n-1). Scaling H by 10^-n would make it 0.999999999... . That leading 0 needs to be removed, by multiplying by 10, do put it in the right form. First of all, I don't know if a case like this is possible. Second of all, the check would fail either way (1 against 0 vs 1 against 9). BTW, b / d represents only the significant digits of H, so it shouldn't matter that there's no corresponding adjustment to the input string. To summarize: I'm not saying this code is necessary; I'm just saying it makes you wonder. |
Dropping this due to lack of time; unless anyone else wants to pick it up, it should probably be closed as "won't fix". |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: