Issue15026
This issue tracker has been migrated to GitHub,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2012-06-07 13:56 by serhiy.storchaka, last changed 2022-04-11 14:57 by admin. This issue is now closed.
Files | ||||
---|---|---|---|---|
File name | Uploaded | Description | Edit | |
encode-utf16.patch | serhiy.storchaka, 2012-06-07 13:56 | review | ||
encode-utf16-2.patch | serhiy.storchaka, 2012-06-15 19:35 | review |
Messages (11) | |||
---|---|---|---|
msg162473 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * | Date: 2012-06-07 13:56 | |
In pair to issue14624 here is a patch than speed up UTF-16 encoding in several times. In addition, it fixes an unsafe check of an integer overflow. Here are the results of benchmarking. See benchmark tools in https://bitbucket.org/storchaka/cpython-stuff repository. On 32-bit Linux, AMD Athlon 64 X2 4600+ @ 2.4GHz: Py2.7 Py3.2 Py3.3 patched 457 (+575%) 458 (+573%) 1077 (+186%) 3083 encode utf-16le 'A'*10000 457 (+579%) 493 (+529%) 1084 (+186%) 3102 encode utf-16le '\x80'*10000 489 (+534%) 458 (+577%) 1081 (+187%) 3102 encode utf-16le '\x80'+'A'*9999 457 (+1261%) 493 (+1161%) 1116 (+457%) 6219 encode utf-16le '\u0100'*10000 489 (+1266%) 458 (+1358%) 1126 (+493%) 6678 encode utf-16le '\u0100'+'A'*9999 489 (+1263%) 458 (+1355%) 1129 (+490%) 6666 encode utf-16le '\u0100'+'\x80'*9999 457 (+1240%) 493 (+1142%) 1118 (+448%) 6125 encode utf-16le '\u8000'*10000 489 (+1271%) 458 (+1363%) 1127 (+495%) 6702 encode utf-16le '\u8000'+'A'*9999 489 (+1271%) 458 (+1364%) 1129 (+494%) 6705 encode utf-16le '\u8000'+'\x80'*9999 489 (+1135%) 458 (+1218%) 1136 (+432%) 6038 encode utf-16le '\u8000'+'\u0100'*9999 498 (+128%) 505 (+125%) 630 (+80%) 1137 encode utf-16le '\U00010000'*10000 489 (+35%) 458 (+44%) 360 (+83%) 659 encode utf-16le '\U00010000'+'A'*9999 489 (+35%) 458 (+44%) 359 (+84%) 660 encode utf-16le '\U00010000'+'\x80'*9999 489 (+36%) 458 (+45%) 361 (+84%) 663 encode utf-16le '\U00010000'+'\u0100'*9999 489 (+36%) 458 (+45%) 361 (+84%) 663 encode utf-16le '\U00010000'+'\u8000'*9999 447 (+507%) 493 (+450%) 1086 (+150%) 2712 encode utf-16be 'A'*10000 447 (+513%) 493 (+456%) 1080 (+154%) 2739 encode utf-16be '\x80'*10000 489 (+458%) 458 (+496%) 1079 (+153%) 2729 encode utf-16be '\x80'+'A'*9999 447 (+498%) 494 (+441%) 1118 (+139%) 2672 encode utf-16be '\u0100'*10000 489 (+464%) 458 (+502%) 1128 (+144%) 2756 encode utf-16be '\u0100'+'A'*9999 489 (+463%) 458 (+502%) 1131 (+144%) 2755 encode utf-16be '\u0100'+'\x80'*9999 447 (+500%) 493 (+444%) 1119 (+139%) 2680 encode utf-16be '\u8000'*10000 489 (+463%) 458 (+502%) 1126 (+145%) 2755 encode utf-16be '\u8000'+'A'*9999 489 (+464%) 458 (+502%) 1129 (+144%) 2757 encode utf-16be '\u8000'+'\x80'*9999 489 (+479%) 458 (+518%) 1137 (+149%) 2829 encode utf-16be '\u8000'+'\u0100'*9999 499 (+102%) 506 (+99%) 630 (+60%) 1009 encode utf-16be '\U00010000'*10000 489 (+6%) 458 (+13%) 360 (+44%) 519 encode utf-16be '\U00010000'+'A'*9999 489 (+6%) 458 (+13%) 359 (+44%) 518 encode utf-16be '\U00010000'+'\x80'*9999 489 (+6%) 458 (+13%) 361 (+44%) 519 encode utf-16be '\U00010000'+'\u0100'*9999 489 (+6%) 458 (+13%) 361 (+44%) 519 encode utf-16be '\U00010000'+'\u8000'*9999 |
|||
msg162701 - (view) | Author: Antoine Pitrou (pitrou) * | Date: 2012-06-13 09:37 | |
Here are results under 64-bit Linux on a Core i5-2500K: 3.3 patched 3327 (+360%) 15304 encode utf-16le 'A'*10000 3314 (+335%) 14413 encode utf-16le '\x80'*10000 3315 (+578%) 22472 encode utf-16le '\x80'+'A'*9999 2390 (+668%) 18345 encode utf-16le '\u0100'*10000 2390 (+668%) 18364 encode utf-16le '\u0100'+'A'*9999 2324 (+684%) 18219 encode utf-16le '\u0100'+'\x80'*9999 2385 (+664%) 18227 encode utf-16le '\u8000'*10000 2390 (+669%) 18383 encode utf-16le '\u8000'+'A'*9999 2390 (+663%) 18232 encode utf-16le '\u8000'+'\x80'*9999 2385 (+601%) 16708 encode utf-16le '\u8000'+'\u0100'*9999 1601 (-4%) 1542 encode utf-16le '\U00010000'*10000 1209 (+20%) 1448 encode utf-16le '\U00010000'+'A'*9999 1210 (+20%) 1447 encode utf-16le '\U00010000'+'\x80'*9999 1209 (+20%) 1446 encode utf-16le '\U00010000'+'\u0100'*9999 1209 (+20%) 1446 encode utf-16le '\U00010000'+'\u8000'*9999 3237 (+562%) 21422 encode utf-16be 'A'*10000 3294 (+500%) 19779 encode utf-16be '\x80'*10000 3290 (+357%) 15036 encode utf-16be '\x80'+'A'*9999 2382 (+209%) 7354 encode utf-16be '\u0100'*10000 2381 (+208%) 7342 encode utf-16be '\u0100'+'A'*9999 2377 (+209%) 7347 encode utf-16be '\u0100'+'\x80'*9999 2382 (+207%) 7317 encode utf-16be '\u8000'*10000 2381 (+208%) 7343 encode utf-16be '\u8000'+'A'*9999 2376 (+209%) 7343 encode utf-16be '\u8000'+'\x80'*9999 2377 (+206%) 7281 encode utf-16be '\u8000'+'\u0100'*9999 1598 (-42%) 930 encode utf-16be '\U00010000'*10000 1208 (+19%) 1436 encode utf-16be '\U00010000'+'A'*9999 1208 (+19%) 1436 encode utf-16be '\U00010000'+'\x80'*9999 1205 (+19%) 1434 encode utf-16be '\U00010000'+'\u0100'*9999 1205 (+19%) 1433 encode utf-16be '\U00010000'+'\u8000'*9999 |
|||
msg162822 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * | Date: 2012-06-14 20:29 | |
Thank you, Antoine. > 3327 (+360%) 15304 encode utf-16le 'A'*10000 > 3314 (+335%) 14413 encode utf-16le '\x80'*10000 > 3290 (+357%) 15036 encode utf-16be '\x80'+'A'*9999 It must be a fluctuation (-30-40%). For all UCS1 strings the same code is used. > 1598 (-42%) 930 encode utf-16be '\U00010000'*10000 This is most likely the fluctuation too. Code for non-BMP characters is different from the code for other characters in UCS4 string, but unlikely a difference is 1.5x. Reproduced whether this result? On 32-bit Linux, Intel Atom N570 @ 1.66GHz: Py2.7 Py3.2 Py3.3 patched 273 (+229%) 274 (+227%) 333 (+169%) 897 encode utf-16le 'A'*10000 274 (+226%) 275 (+225%) 334 (+168%) 894 encode utf-16le '\x80'*10000 274 (+231%) 275 (+230%) 334 (+172%) 908 encode utf-16le '\x80'+'A'*9999 273 (+752%) 275 (+746%) 276 (+743%) 2326 encode utf-16le '\u0100'*10000 274 (+695%) 275 (+692%) 276 (+689%) 2177 encode utf-16le '\u0100'+'A'*9999 274 (+739%) 275 (+736%) 276 (+733%) 2300 encode utf-16le '\u0100'+'\x80'*9999 274 (+739%) 275 (+736%) 276 (+733%) 2298 encode utf-16le '\u8000'*10000 274 (+697%) 274 (+697%) 276 (+691%) 2184 encode utf-16le '\u8000'+'A'*9999 274 (+741%) 274 (+741%) 277 (+731%) 2303 encode utf-16le '\u8000'+'\x80'*9999 274 (+770%) 275 (+767%) 276 (+764%) 2384 encode utf-16le '\u8000'+'\u0100'*9999 279 (+51%) 279 (+51%) 217 (+94%) 422 encode utf-16le '\U00010000'*10000 274 (+6%) 274 (+6%) 162 (+79%) 290 encode utf-16le '\U00010000'+'A'*9999 274 (+6%) 274 (+6%) 162 (+79%) 290 encode utf-16le '\U00010000'+'\x80'*9999 273 (+5%) 275 (+5%) 162 (+78%) 288 encode utf-16le '\U00010000'+'\u0100'*9999 274 (+5%) 275 (+5%) 162 (+78%) 288 encode utf-16le '\U00010000'+'\u8000'*9999 274 (+152%) 275 (+151%) 334 (+107%) 690 encode utf-16be 'A'*10000 274 (+154%) 275 (+153%) 334 (+109%) 697 encode utf-16be '\x80'*10000 274 (+152%) 275 (+151%) 333 (+108%) 691 encode utf-16be '\x80'+'A'*9999 274 (+146%) 275 (+145%) 276 (+145%) 675 encode utf-16be '\u0100'*10000 274 (+146%) 275 (+145%) 276 (+145%) 675 encode utf-16be '\u0100'+'A'*9999 274 (+145%) 275 (+144%) 276 (+143%) 671 encode utf-16be '\u0100'+'\x80'*9999 274 (+145%) 275 (+144%) 276 (+143%) 672 encode utf-16be '\u8000'*10000 275 (+147%) 275 (+147%) 276 (+146%) 680 encode utf-16be '\u8000'+'A'*9999 274 (+146%) 275 (+145%) 276 (+144%) 674 encode utf-16be '\u8000'+'\x80'*9999 275 (+143%) 275 (+143%) 276 (+142%) 667 encode utf-16be '\u8000'+'\u0100'*9999 279 (+26%) 279 (+26%) 217 (+62%) 351 encode utf-16be '\U00010000'*10000 274 (-2%) 275 (-3%) 162 (+65%) 268 encode utf-16be '\U00010000'+'A'*9999 274 (-2%) 275 (-3%) 162 (+65%) 268 encode utf-16be '\U00010000'+'\x80'*9999 274 (-4%) 275 (-4%) 162 (+63%) 264 encode utf-16be '\U00010000'+'\u0100'*9999 274 (-3%) 275 (-4%) 162 (+64%) 265 encode utf-16be '\U00010000'+'\u8000'*9999 |
|||
msg162924 - (view) | Author: Antoine Pitrou (pitrou) * | Date: 2012-06-15 17:34 | |
Serhiy, the tests crash here in debug mode: $ ./python -m test -v test_unicode == CPython 3.3.0a4+ (default:b17c8005e08a+, Jun 15 2012, 19:28:56) [GCC 4.5.2] == Linux-2.6.38.8-desktop-10.mga-x86_64-with-mandrake-1-Official little-endian == /home/antoine/cpython/default/build/test_python_2567 Testing with flags: sys.flags(debug=0, inspect=0, interactive=0, optimize=0, dont_write_bytecode=0, no_user_site=0, no_site=0, ignore_environment=0, verbose=0, bytes_warning=0, quiet=0, hash_randomization=1) [1/1] test_unicode test_formatter_field_name_split (test.test_unicode.StringModuleTest) ... ok test_formatter_parser (test.test_unicode.StringModuleTest) ... ok test___contains__ (test.test_unicode.UnicodeTest) ... ok test_additional_rsplit (test.test_unicode.UnicodeTest) ... ok test_additional_split (test.test_unicode.UnicodeTest) ... ok test_ascii (test.test_unicode.UnicodeTest) ... ok test_aswidechar (test.test_unicode.UnicodeTest) ... ok test_aswidecharstring (test.test_unicode.UnicodeTest) ... ok test_bug1001011 (test.test_unicode.UnicodeTest) ... ok test_bytes_comparison (test.test_unicode.UnicodeTest) ... ok test_capitalize (test.test_unicode.UnicodeTest) ... ok test_casefold (test.test_unicode.UnicodeTest) ... ok test_center (test.test_unicode.UnicodeTest) ... ok test_codecs (test.test_unicode.UnicodeTest) ... python: Objects/unicodeobject.c:5401: _PyUnicode_EncodeUTF16: Assertion `(Py_uintptr_t)(((((((((PyObject*)(v))->ob_type))->tp_flags & ((1L<<27))) != 0)) ? (void) (0) : __assert_fail ("((((((PyObject*)(v))->ob_type))->tp_flags & ((1L<<27))) != 0)", "Objects/unicodeobject.c", 5401, __PRETTY_FUNCTION__)), (((PyBytesObject *)(v))->ob_sval)) & 1 == 0' failed. Fatal Python error: Aborted Current thread 0x00007faa4980e700: File "/home/antoine/cpython/default/Lib/test/test_unicode.py", line 1443 in test_codecs File "/home/antoine/cpython/default/Lib/unittest/case.py", line 385 in _executeTestPart File "/home/antoine/cpython/default/Lib/unittest/case.py", line 440 in run File "/home/antoine/cpython/default/Lib/unittest/case.py", line 492 in __call__ File "/home/antoine/cpython/default/Lib/unittest/suite.py", line 105 in run File "/home/antoine/cpython/default/Lib/unittest/suite.py", line 67 in __call__ File "/home/antoine/cpython/default/Lib/unittest/suite.py", line 105 in run File "/home/antoine/cpython/default/Lib/unittest/suite.py", line 67 in __call__ File "/home/antoine/cpython/default/Lib/unittest/suite.py", line 105 in run File "/home/antoine/cpython/default/Lib/unittest/suite.py", line 67 in __call__ File "/home/antoine/cpython/default/Lib/unittest/runner.py", line 168 in run File "/home/antoine/cpython/default/Lib/test/support.py", line 1383 in _run_suite File "/home/antoine/cpython/default/Lib/test/support.py", line 1417 in run_unittest File "/home/antoine/cpython/default/Lib/test/test_unicode.py", line 1954 in test_main File "/home/antoine/cpython/default/Lib/test/regrtest.py", line 1237 in runtest_inner File "/home/antoine/cpython/default/Lib/test/regrtest.py", line 918 in runtest File "/home/antoine/cpython/default/Lib/test/regrtest.py", line 710 in main File "/home/antoine/cpython/default/Lib/test/__main__.py", line 13 in <module> File "/home/antoine/cpython/default/Lib/runpy.py", line 75 in _run_code File "/home/antoine/cpython/default/Lib/runpy.py", line 162 in _run_module_as_main Abandon |
|||
msg162929 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * | Date: 2012-06-15 19:35 | |
> Serhiy, the tests crash here in debug mode: My fault. It's operator precedence issue in the assert expression. Gcc warns about it: Objects/unicodeobject.c: In function ‘_PyUnicode_EncodeUTF16’: Objects/unicodeobject.c:5401: warning: suggest parentheses around comparison in operand of ‘&’ Here is a fixed patch. |
|||
msg162930 - (view) | Author: Roundup Robot (python-dev) | Date: 2012-06-15 20:18 | |
New changeset acca141fda80 by Antoine Pitrou in branch 'default': Issue #15026: utf-16 encoding is now significantly faster (up to 10x). http://hg.python.org/cpython/rev/acca141fda80 |
|||
msg162931 - (view) | Author: Antoine Pitrou (pitrou) * | Date: 2012-06-15 20:19 | |
Thank you for the quick turnaround! The patch is now pushed in 3.3. |
|||
msg162933 - (view) | Author: STINNER Victor (vstinner) * | Date: 2012-06-15 20:21 | |
It would be nice to mention the improvement in the What's New in Python 3.3 doc (Optimizations section). |
|||
msg162934 - (view) | Author: Roundup Robot (python-dev) | Date: 2012-06-15 20:25 | |
New changeset 35667fc5f785 by Antoine Pitrou in branch 'default': Mention the UTF-16 encoding speedup in the whatsnew (issue #15026). http://hg.python.org/cpython/rev/35667fc5f785 |
|||
msg162960 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * | Date: 2012-06-16 08:43 | |
Thank you for pushing. :-) Are you interested in a faster UTF-32 codec? |
|||
msg162961 - (view) | Author: Antoine Pitrou (pitrou) * | Date: 2012-06-16 09:03 | |
> Thank you for pushing. :-) Are you interested in a faster UTF-32 codec? Not much :) I know you posted issues on that, but I think UTF-32 is quite low priority. |
History | |||
---|---|---|---|
Date | User | Action | Args |
2022-04-11 14:57:31 | admin | set | github: 59231 |
2012-06-16 09:03:30 | pitrou | set | messages: + msg162961 |
2012-06-16 08:43:11 | serhiy.storchaka | set | messages: + msg162960 |
2012-06-15 20:25:25 | python-dev | set | messages: + msg162934 |
2012-06-15 20:21:43 | vstinner | set | messages: + msg162933 |
2012-06-15 20:19:14 | pitrou | set | status: open -> closed resolution: fixed messages: + msg162931 stage: resolved |
2012-06-15 20:18:32 | python-dev | set | nosy:
+ python-dev messages: + msg162930 |
2012-06-15 19:35:12 | serhiy.storchaka | set | files:
+ encode-utf16-2.patch messages: + msg162929 |
2012-06-15 17:34:47 | pitrou | set | messages: + msg162924 |
2012-06-14 20:29:52 | serhiy.storchaka | set | messages: + msg162822 |
2012-06-13 09:37:49 | pitrou | set | messages: + msg162701 |
2012-06-07 13:56:13 | serhiy.storchaka | create |