New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster utf-8 decoding #58859
Comments
The utf-8 decoder is already well optimized. I propose a patch, which accelerates the utf-8 decoder for some of the frequent cases even more (+10-30%). In particular, for 2-bites non-latin1 codes will get about +30%. This is not the final result of optimization. It may be possible to optimize the decoding of the ascii and mostly-ascii text (up to the speed of memcpy), decoding of text with occasional errors, reduce code duplication. But I'm not sure of the success. Related issues: |
Here are the results of benchmarking (numbers in MB/s). On 32-bit Linux, AMD Athlon 64 X2 4600+ @ 2.4GHz:
utf-8 'A'*10000 191 (+790%) 1170 (+45%) 1664 (+2%) 1700 On 32-bit Linux, Intel Atom N570 @ 1.66GHz:
utf-8 'A'*10000 117 (+414%) 349 (+72%) 597 (+1%) 601 The results were ambiguous (everywhere plus, but in different ways). I |
64-bit Linux, Intel Core i5-2500K CPU @ 3.30GHz:
utf-8 'A'*10000 6668 (+7%) 7145 |
Hum, the patch doesn't look very interesting if it only optimize one
|
Thank you, Antoine. It is interesting results, that on 64 bits greatly Here is a patch, which is risky reception with signed numbers. For me, |
I'm -1 on using signed char in the implementation. If this gives any advantage, it's because the compiler is not able to generate as efficient code for unsigned char as it does for signed char. So the performance results may again change if you switch compilers, or use the next compiler version. The code should do what is *logically* correct; IMO, UTF-8 is really a sequence of unsigned bytes, conceptually. So if you want to demonstrate any performance improvements, you need to do so with unsigned chars. |
I completely agree with you, for these and for other not mentioned |
Here are two new patches. The first one takes into account the Martin On the Intel Atom last patch annihilates acceleration for some cases
utf-8 'A'*9999+'\u0100' 124 (+8%) 288 (-53%) 134 On the AMD Athlon there is no noticeable effect. |
Éric, there is already an issue (bpo-4868) with this title. |
There is nothing wrong with two issues having the same title. Of course, it would be best if the title reflected the *actual* defect or change, such as "specialize UTF-8 decoding by character width", or some such. In any case, the title change is desirable since the original title was ungrammatical. If you wanted to point out that this really is an augmented, escalated rise, then "Even faster utf-8 decoded", "amazingly faster UTF-8 decoding", or "unbelievably faster utf-8 decoding" could have worked :-) |
Thank you, Martin, this is what I had in mind. Lost in translation. ;) |
64-bit Linux, Intel Core i5-2500K CPU @ 3.30GHz:
utf-8 'A'*10000 6931 (+3%) 7115 (+0%) 7117 |
Well, it seems, 64-bit processors are smart enough to not feel the need I am now working on a more advanced optimization, which now shows a gain |
I'll be closing this issue at this point. Serhiy: I don't think the bug tracker should be used to evolve work in progress (except when responding to reviews received). Use a Mercurial clone for that instead. By posting a patch here, you are requesting that it be reviewed and considered - please understand that you consume a lot of people's time by such a posting. At this point, it appears that you don't intend to submit any of these patches for inclusion into Python. If you ever do want to contribute something in this area, please create a new issue. |
That's not very nice. If Serhiy wants feedback on his work, he |
I completely disagree (and I really tried to be nice). It is my utmost belief that the tracker must not be used for OTOH, discussing it on python-dev indeed seems more appropriate: However, it would really be best in this case if Serhiy takes a step He may come to the conclusion that further improvement isn't really |
I understand Martin point, but I think 95% of issues in the bugtracker are "work in progress", mine included. Maybe the issue is that Serhiy hasn't made a concrete proposal to be tested & integrated. It seems to be more an exploratory work. I am in the nosy list because I am interested in this work. |
Martin, sorry to have wasted your time. I understand that you are busy,
I'm at a loss. What causes such an impression? I quickly reacting to the |
See bpo-14738 for advanced optimization. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: