classification
Title: micro-optimization: increase our float parsing speed by Nx
Type: enhancement Stage: needs patch
Components: Interpreter Core Versions: Python 3.10
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: eric.smith, gregory.p.smith, mark.dickinson, pitrou, rhettinger, serhiy.storchaka, tim.peters
Priority: low Keywords:

Created on 2020-07-15 22:21 by gregory.p.smith, last changed 2020-10-19 17:27 by pitrou.

Messages (6)
msg373730 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2020-07-15 22:21
See https://lemire.me/blog/2020/03/10/fast-float-parsing-in-practice/ for inspiration and a reference (possibly a library to use, but if not the techniques still apply).

Primarily usable when creating the float objects from the string data as is common in python code and particularly common in JSON serialized data.
msg373743 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2020-07-16 05:57
CPython is written on C, not C++, so we cannot use std::from_chars.

Note also that to parse a number in JSON we need first to scan the PyUnicode object character-by-character using PyUnicode_READ() which is slower than just reading a byte from a memory, then convert a substring to a nul-terminated bytes string, and finally use the conversion function  which creates a new PyFloat or PyLong objects. The overhead of scanning a string and memory management may be larger than the parsing time itself.
msg373760 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2020-07-16 18:50
Is there a description of the algorithms or ideas used anywhere? The blog post you link to has no details, and I don't see any useful descriptions in the GitHub README or the source; trawling through the source to figure out what the key ideas are is not ideal. I don't find the tests particularly convincing, either.

I think this is too immature to consider for CPython, right now. Maybe it would make sense to revisit some day in the future if/when the library is more mature.
msg373769 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2020-07-16 20:38
Gregory, care to take their code and time it against Python?

I'm not inclined to: reading the comments in the code, they're trying "fast paths" already described in papers by Clinger and - later - by Gay.  When those fast paths don't apply, they fall back on the system `strtod`. But Python's current implementation builds on Gay's fastest code (and never falls back to the system).

It's possible they "improved" on Gay by building relatively giant tables of static data.
msg373775 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2020-07-16 22:00
I don't think we'd want to fall back to strtod, but rather to the code we already use (Gay's). So this would increase our maintenance burden.

I'm also not convinced that we actually spend a lot of time in this code, but of course profiling would be needed to make sure.
msg378969 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2020-10-19 17:27
FTR, no formal description of the algorithm seems to have been published, but the source code is amply commented:
https://github.com/lemire/fast_double_parser/blob/master/include/fast_double_parser.h
History
Date User Action Args
2020-10-19 17:27:20pitrousetpriority: normal -> low
2020-10-19 17:27:11pitrousetnosy: + pitrou
messages: + msg378969
2020-10-19 17:20:42pitroulinkissue42081 superseder
2020-07-16 22:00:57eric.smithsetmessages: + msg373775
2020-07-16 20:38:02tim.peterssetmessages: + msg373769
2020-07-16 18:50:06mark.dickinsonsetmessages: + msg373760
2020-07-16 05:57:27serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg373743
2020-07-16 05:16:39rhettingersetnosy: + tim.peters, rhettinger
2020-07-15 23:09:53eric.smithsetnosy: + mark.dickinson, eric.smith
2020-07-15 22:21:55gregory.p.smithcreate