Issue 26436: Add the regex-dna benchmark

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/70623

classification

Title:	Add the regex-dna benchmark
Type:	enhancement	Stage:	patch review
Components:	Benchmarks, Regular Expressions	Versions:

process

Status:	closed	Resolution:	third party
Dependencies:		Superseder:
Assigned To:		Nosy List:	brett.cannon, ezio.melotti, mrabarnett, pitrou, serhiy.storchaka, terry.reedy, vstinner
Priority:	normal	Keywords:	patch

Created on 2016-02-25 12:17 by serhiy.storchaka, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
bm_regex_dna.patch	serhiy.storchaka, 2016-02-25 12:17

Messages (12)
msg260854 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2016-02-25 12:17
Proposed patch adds the regex-dna benchmark from The Computer Language Benchmarks Game (http://benchmarksgame.alioth.debian.org/). This is artificial but well known benchmark. The patch is based on regex-dna Python 3 #5 program and fasta Python 3 #3 program (for generating input).
msg260872 - (view)	Author: Brett Cannon (brett.cannon) *	Date: 2016-02-25 17:42
I assume the code looks idiomatic to you? And out of curiosity, what does the performance look like between something 3.5 and default?
msg260873 - (view)	Author: Brett Cannon (brett.cannon) *	Date: 2016-02-25 17:43
Oh, and how long does an execution take?
msg260932 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2016-02-27 08:40
> I assume the code looks idiomatic to you? Sorry, I have not understood your question. Could you please reformulate? The performance of all Python versions is rough the same. 2.7 is about 8% slower than 3.2 and 3.3, 3.4-default are about 3-4% slower than 3.2 and 3.3. I have taken input data size such that the regex-dna benchmark runs rough the same time as the slowest regex benchmark regex-compile (0.7 sec per iteration on my computer, about a minute with default options). This is 1/50 of the size used in The Computer Language Benchmarks Game. Since the benchmark generates input data, its size can easily be changed. Needed only update control sums.
msg260938 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2016-02-27 10:24
I believe Brett is asking whether the code looks like the sort of Python code that one of us might write, as opposed to 'language X in Python'. In my quick perusal, As far as I looked, I would say yes, except for using floats and while instead of int and for because the former are supposedly faster. (See the loop in the middle of random_fasta.) So do we want a benchmark micro-optimized for CSF's system or written 'normally' (with for, int, and range). I did not notice any PEP 8 style violations.
msg260942 - (view)	Author: Brett Cannon (brett.cannon) *	Date: 2016-02-27 16:54
Terry's right about what I meant; is the code of such quality that you would let it into the stdlib? As for execution time, I would vote for increasing the input size to take 1 second as it's just going to get faster and faster just from CPU improvements alone.
msg260943 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2016-02-27 16:57
I would add another kind of question: is it stressing something useful that isn't already stressed by the two other regex benchmarks we already have? Given that it seems built around a highly-specialized use case (DNA matching?) and we don't even know if regular expressions are actually the tool of choice in the field (unless someone here is a specialist), I'm rather skeptical. (in general, everything coming the "Computer Language Benchmarks Game" should be taken with a grain of salt IMHO: it's mostly people wanting to play writing toy programs)
msg260950 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2016-02-27 23:10
DNA matching can be done with difflib. Serious high-volume work should use compiled specialized matchers and aligners. This particular benchmark, explained a bit at https://benchmarksgame.alioth.debian.org/u64q/regexdna-description.html#regexdna, manipulates and searches standard FASTA format representations of sequences with the regex available in each language. (The site has another Python implementation at https://benchmarksgame.alioth.debian.org/u64q/program.php?test=regexdna&lang=python3&id=1. It uses unicode strings rather than bytes, and multiprocessing.Pool to run re.findall in parallel.) FASTA uses lowercase a,c,g,t for known bases and at least 11 uppercase letters for subsets of bases representing partially known bases. The third task is to expand upper case letters to subsets of lowercase letters. Since the rules requires use of re and one substitution at a time, the 2 Python programs run re.sub over the current sequence 11 times. More idiomatic for Python, and probably faster, would be to use seq.replace(old,new) instead. Perhaps even more idiomatic and probably faster still, would be to use str.translate, as in this reduced example. >>> table = {ord('B') : '(c\|g\|t)', ord('D') : '(a\|g\|t)'} >>> 'aBcDg'.translate(table) 'a(c\|g\|t)c(a\|g\|t)g'
msg261046 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2016-03-01 10:02
I used the code from fasta and regex-dna tests almost without changes. I.e. one part create the data in standard FASTA format (with 60-character lines and headers), and other part parses this format. The code can be simple if generate and consume raw data. As for the quality of the code, tested code is pretty simple and enough pythonic. Yes, using replace() is more idiomatic and faster, but we are testing regular expressions. bytes.translate() doesn't work with dict, and str.translate() is slower than replace() or re.sub(). The code for generating test data is not the kind of the code that should be used in tutorials. It is highly optimized code that uses different optimization tricks that could be hard to understand without comments. But nothing unpythonic. It can be simplified if avoid formatting the data in standard FASTA format. > I would add another kind of question: is it stressing something useful that isn't already stressed by the two other regex benchmarks we already have? Yes, it is. The regex_v8 benchmark is 2x faster with regex than with re. But the regex_dna benchmark is 1.6x slower with regex than with re. Thus these tests are stressing different aspects of regular expressions. It may be worth also to test regular expressions with unicode strings. I expect some difference with latest Python and earlier 3.x and 2.7. The question is how to do this? Add a special option to switch between bytes and unicode (as --force_bytes in regex_effbot), or just run tests for bytes and unicode sequentially and add results?
msg276219 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-09-13 09:06
Serhiy: Can you please open a pull request on the new performance module? https://github.com/python/performance
msg278889 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-10-18 15:03
I created https://github.com/python/performance/issues/15
msg278908 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-10-18 16:46
The performance project is now hosted on GitHub. I created a pull request from Serhiy's patch: https://github.com/python/performance/pull/17 So I now close this issue. Let's continue the discussion there.

History
Date	User	Action	Args
2022-04-11 14:58:28	admin	set	github: 70623
2016-10-18 16:46:41	vstinner	set	status: open -> closed resolution: third party messages: + msg278908
2016-10-18 15:03:11	vstinner	set	messages: + msg278889
2016-10-16 22:17:36	serhiy.storchaka	set	nosy: + ezio.melotti, mrabarnett components: + Regular Expressions
2016-09-13 09:06:08	vstinner	set	nosy: + vstinner messages: + msg276219
2016-03-01 10:02:30	serhiy.storchaka	set	messages: + msg261046
2016-02-27 23:10:06	terry.reedy	set	messages: + msg260950
2016-02-27 16:57:22	pitrou	set	messages: + msg260943
2016-02-27 16:54:01	brett.cannon	set	messages: + msg260942
2016-02-27 10:24:07	terry.reedy	set	nosy: + terry.reedy messages: + msg260938
2016-02-27 08:40:27	serhiy.storchaka	set	messages: + msg260932
2016-02-25 17:43:17	brett.cannon	set	messages: + msg260873
2016-02-25 17:42:51	brett.cannon	set	messages: + msg260872
2016-02-25 12:17:37	serhiy.storchaka	create