Issue 39150: See if PyToken_OneChar would be faster as a lookup table

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/83331

classification

Title:	See if PyToken_OneChar would be faster as a lookup table
Type:	enhancement	Stage:	resolved
Components:	Interpreter Core	Versions:

process

Status:	closed	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	pablogsal, petdance, serhiy.storchaka
Priority:	normal	Keywords:

Created on 2019-12-29 01:45 by petdance, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Messages (9)
msg358975 - (view)	Author: Andy Lester (petdance) *	Date: 2019-12-29 01:45
PyToken_OneChar in Parser/token.c is autogenerated. I suspect it may be faster and smaller if it were a lookup into a static table of ops rather than a switch statement. Check to see if it is.
msg358980 - (view)	Author: Pablo Galindo Salgado (pablogsal) *	Date: 2019-12-29 05:19
> PyToken_OneChar in Parser/token.c is autogenerated. I suspect it may be faster and smaller if it were a lookup into a static table of ops rather than a switch statement. Check to see if it is. I suspect that it will be marginally faster (because it won't need to do bound checks to go to the default case) but under PGO it will make almost no difference. You can try to see if there is any significant improvement using the pyperformance suite (https://pyperformance.readthedocs.io/) with and without your change (with PGO/LTO activated).
msg358982 - (view)	Author: Andy Lester (petdance) *	Date: 2019-12-29 06:19
Thank you. I appreciate the pointer.
msg359650 - (view)	Author: Andy Lester (petdance) *	Date: 2020-01-09 02:59
I tried out some experimenting with the lookup table vs. the switch statement. The relevant diff (not including the patches to the code generator) is: --- Parser/token.c +++ Parser/token.c @@ -77,31 +77,36 @@ int PyToken_OneChar(int c1) { - switch (c1) { - case '%': return PERCENT; - case '&': return AMPER; - case '(': return LPAR; - case ')': return RPAR; - case '': return STAR; - case '+': return PLUS; - case ',': return COMMA; - case '-': return MINUS; - case '.': return DOT; - case '/': return SLASH; - case ':': return COLON; - case ';': return SEMI; - case '<': return LESS; - case '=': return EQUAL; - case '>': return GREATER; - case '@': return AT; - case '[': return LSQB; - case ']': return RSQB; - case '^': return CIRCUMFLEX; - case '{': return LBRACE; - case '\|': return VBAR; - case '}': return RBRACE; - case '~': return TILDE; - } + static char op_lookup[] = { + OP, OP, OP, OP, OP, + OP, OP, OP, OP, OP, + OP, OP, OP, OP, OP, + OP, OP, OP, OP, OP, + OP, OP, OP, OP, OP, + OP, OP, OP, OP, OP, + OP, OP, OP, OP, OP, + OP, OP, PERCENT, AMPER, OP, + LPAR, RPAR, STAR, PLUS, COMMA, + MINUS, DOT, SLASH, OP, OP, + OP, OP, OP, OP, OP, + OP, OP, OP, COLON, SEMI, + LESS, EQUAL, GREATER, OP, AT, + OP, OP, OP, OP, OP, + OP, OP, OP, OP, OP, + OP, OP, OP, OP, OP, + OP, OP, OP, OP, OP, + OP, OP, OP, OP, OP, + OP, LSQB, OP, RSQB, CIRCUMFLEX, + OP, OP, OP, OP, OP, + OP, OP, OP, OP, OP, + OP, OP, OP, OP, OP, + OP, OP, OP, OP, OP, + OP, OP, OP, OP, OP, + OP, OP, OP, LBRACE, VBAR, + RBRACE, TILDE + }; + if (c1>=37 && c1<=126) + return op_lookup[c1]; return OP; } To test the speed change, I couldn't use pyperformance, because the only thing I wanted to time was the In my testing, I didn't use pyperformance because the only part of the code I wanted to test was the actual compilation of the code. My solution for this was to find the 100 largest .py files in the cpython repo and compile them like so: python -m py_compile $(List-of-big-*.py-files) The speedup was significant: My table-driven lookup ran the compile tests about 10% than the existing switch approach. That was without --enable-optimizations in my configure. However, as pablogsal suspected, with PGO enabled, the two approaches ran the code in pretty much the same speed. I do think that there may be merit in using a table-driven approach that generates less code and doesn't rely on PGO speeding things up. If anyone's interested, all my work is on branch Issue39150 in my fork petdance/cpython.
msg359663 - (view)	Author: Pablo Galindo Salgado (pablogsal) *	Date: 2020-01-09 10:31
I will take a look and try to reproduce the benchmarks result you mentioned, but if under PGO there is no substantial performance increase is much less appealing as the lookup table code is less clear and maintainable than the simple switch. Also you have a branch that is a good candidate to trigger the same branch mispredictions as the default of the switch (c1>=37 && c1<=126).
msg359664 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2020-01-09 10:36
The difference around 10% looks very strange. Tokenizing is just one of parts of the compiler, and unlikely it is a narrow way. There are input/output, many memory allocations, encoding and decoding, many iterations and recursions. Are you sure that you run benchmarks in the same environment, multiple times, and got stable results? Could you provide a script that reproduces this?
msg359684 - (view)	Author: Andy Lester (petdance) *	Date: 2020-01-09 15:09
Yes, I ran it multiple times on my 2013 Macbook Pro and got ~10% speedup. I also ran it on my Linux VM (that I only use for development) and got a speedup but less so. The code I used to run the tests is at: https://github.com/petdance/cpython/blob/Issue39150/timetests LOG=timelog.txt NRUNS=100 NFILES=100 if [ -x ./python.exe ] ; then PYTHON=./python.exe else PYTHON=./python fi # Build a file list, but throw out some that have syntax errors. # Use the 100 biggest files. FILES=$( find . -name '*.py' -type f -ls \ \| grep -v Tools/test2to3/ \ \| grep -v Lib/lib2to3/tests/ \ \| grep -v Lib/test/badsyntax \ \| grep -v Lib/test/bad_coding \ \| sort -r -n -k7 \ \| awk '{print $11}' \ \| head -n $NFILES ) echo "Compiling $NFILES files for $NRUNS iterations" rm -f $LOG for (( i=1; i<=$NRUNS; i++ )) do echo "Run $i" { time $PYTHON -m py_compile $FILES; } >> $LOG 2>&1 done perl -ne'if(/real\s+0m([\d.]+)s/){$t+=$1;$n++} END { printf "%d runs, average %0.4f\n", $n, $t/$n}' $LOG
msg359685 - (view)	Author: Andy Lester (petdance) *	Date: 2020-01-09 15:16
Re: branch prediction. The branch if (c1>=37 && c1<=126) could just as easily be if (c1>=0 && c1<=126) with no other changes to the code. It could be just if (c1<=126) if c1 wasn't signed. As far as branch prediction, we could make it something like if (c1>=0 && c1<=255) and expand the table lookup, depending on what the likely inputs are. I could play around with that some if anyone was interested. I'm not trying to push on this. Just sharing some thoughts and research.
msg361637 - (view)	Author: Andy Lester (petdance) *	Date: 2020-02-08 21:52
I'm closing this as it seems there's not much interest in this.

History
Date	User	Action	Args
2022-04-11 14:59:24	admin	set	github: 83331
2020-02-08 21:52:56	petdance	set	status: open -> closed messages: + msg361637 stage: resolved
2020-01-09 15:16:43	petdance	set	messages: + msg359685
2020-01-09 15:09:29	petdance	set	messages: + msg359684
2020-01-09 10:36:03	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg359664
2020-01-09 10:31:22	pablogsal	set	messages: + msg359663
2020-01-09 02:59:51	petdance	set	messages: + msg359650
2019-12-29 06:19:06	petdance	set	messages: + msg358982
2019-12-29 05:19:11	pablogsal	set	nosy: + pablogsal messages: + msg358980
2019-12-29 01:45:21	petdance	create