classification
Title: Add the regex-dna benchmark
Type: enhancement Stage: patch review
Components: Benchmarks, Regular Expressions Versions:
process
Status: closed Resolution: third party
Dependencies: Superseder:
Assigned To: Nosy List: brett.cannon, ezio.melotti, mrabarnett, pitrou, serhiy.storchaka, terry.reedy, vstinner
Priority: normal Keywords: patch

Created on 2016-02-25 12:17 by serhiy.storchaka, last changed 2016-10-18 16:46 by vstinner. This issue is now closed.

Files
File name Uploaded Description Edit
bm_regex_dna.patch serhiy.storchaka, 2016-02-25 12:17
Messages (12)
msg260854 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-02-25 12:17
Proposed patch adds the regex-dna benchmark from The Computer Language Benchmarks Game (http://benchmarksgame.alioth.debian.org/). This is artificial but well known benchmark.

The patch is based on regex-dna Python 3 #5 program and fasta Python 3 #3 program (for generating input).
msg260872 - (view) Author: Brett Cannon (brett.cannon) * (Python committer) Date: 2016-02-25 17:42
I assume the code looks idiomatic to you?

And out of curiosity, what does the performance look like between something 3.5 and default?
msg260873 - (view) Author: Brett Cannon (brett.cannon) * (Python committer) Date: 2016-02-25 17:43
Oh, and how long does an execution take?
msg260932 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-02-27 08:40
> I assume the code looks idiomatic to you?

Sorry, I have not understood your question. Could you please reformulate?

The performance of all Python versions is rough the same. 2.7 is about 8% slower than 3.2 and 3.3, 3.4-default are about 3-4% slower than 3.2 and 3.3.

I have taken input data size such that the regex-dna benchmark runs rough the same time as the slowest regex benchmark regex-compile (0.7 sec per iteration on my computer, about a minute with default options). This is 1/50 of the size used in The Computer Language Benchmarks Game.

Since the benchmark generates input data, its size can easily be changed. Needed only update control sums.
msg260938 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2016-02-27 10:24
I believe Brett is asking whether the code looks like the sort of Python code that one of us might write, as opposed to 'language X in Python'.  In my quick perusal, As far as I looked, I would say yes, except for using floats and while instead of int and for because the former are supposedly faster.  (See the loop in the middle of random_fasta.)  So do we want a benchmark micro-optimized for CSF's system or written 'normally' (with for, int, and range). I did not notice any PEP 8 style violations.
msg260942 - (view) Author: Brett Cannon (brett.cannon) * (Python committer) Date: 2016-02-27 16:54
Terry's right about what I meant; is the code of such quality that you would let it into the stdlib?

As for execution time, I would vote for increasing the input size to take 1 second as it's just going to get faster and faster just  from CPU improvements alone.
msg260943 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2016-02-27 16:57
I would add another kind of question: is it stressing something useful that isn't already stressed by the two other regex benchmarks we already have?

Given that it seems built around a highly-specialized use case (DNA matching?) and we don't even know if regular expressions are actually the tool of choice in the field (unless someone here is a specialist), I'm rather skeptical.

(in general, everything coming the "Computer Language Benchmarks Game" should be taken with a grain of salt IMHO: it's mostly people wanting to play writing toy programs)
msg260950 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2016-02-27 23:10
DNA matching can be done with difflib.  Serious high-volume work should use compiled specialized matchers and aligners.

This particular benchmark, explained a bit at https://benchmarksgame.alioth.debian.org/u64q/regexdna-description.html#regexdna, manipulates and searches standard FASTA format representations of sequences with the regex available in each language.  (The site has another Python implementation at https://benchmarksgame.alioth.debian.org/u64q/program.php?test=regexdna&lang=python3&id=1. It uses unicode strings rather than bytes, and multiprocessing.Pool to run re.findall in parallel.)

FASTA uses lowercase a,c,g,t for known bases and at least 11 uppercase letters for subsets of bases representing partially known bases.  The third task is to expand upper case letters to subsets of lowercase letters.  Since the rules requires use of re and one substitution at a time, the 2 Python programs run re.sub over the current sequence 11 times.  More idiomatic for Python, and probably faster, would be to use seq.replace(old,new) instead.  Perhaps even more idiomatic and probably faster still, would be to use str.translate, as in this reduced example.

>>> table = {ord('B') : '(c|g|t)', ord('D') : '(a|g|t)'}
>>> 'aBcDg'.translate(table)
'a(c|g|t)c(a|g|t)g'
msg261046 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-03-01 10:02
I used the code from fasta and regex-dna tests almost without changes. I.e. one part create the data in standard FASTA format (with 60-character lines and headers), and other part parses this format. The code can be simple if generate and consume raw data.

As for the quality of the code, tested code is pretty simple and enough pythonic. Yes, using replace() is more idiomatic and faster, but we are testing regular expressions. bytes.translate() doesn't work with dict, and str.translate() is slower than replace() or re.sub().

The code for generating test data is not the kind of the code that should be used in tutorials. It is highly optimized code that uses different optimization tricks that could be hard to understand without comments. But nothing unpythonic. It can be simplified if avoid formatting the data in standard FASTA format.

> I would add another kind of question: is it stressing something useful that isn't already stressed by the two other regex benchmarks we already have?

Yes, it is. The regex_v8 benchmark is 2x faster with regex than with re. But the regex_dna benchmark is 1.6x slower with regex than with re. Thus these tests are stressing different aspects of regular expressions.

It may be worth also to test regular expressions with unicode strings. I expect some difference with latest Python and earlier 3.x and 2.7. The question is how to do this? Add a special option to switch between bytes and unicode (as --force_bytes in regex_effbot), or just run tests for bytes and unicode sequentially and add results?
msg276219 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-09-13 09:06
Serhiy: Can you please open a pull request on the new performance module? https://github.com/python/performance
msg278889 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-10-18 15:03
I created https://github.com/python/performance/issues/15
msg278908 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-10-18 16:46
The performance project is now hosted on GitHub. I created a pull request from Serhiy's patch:
https://github.com/python/performance/pull/17

So I now close this issue. Let's continue the discussion there.
History
Date User Action Args
2016-10-18 16:46:41vstinnersetstatus: open -> closed
resolution: third party
messages: + msg278908
2016-10-18 15:03:11vstinnersetmessages: + msg278889
2016-10-16 22:17:36serhiy.storchakasetnosy: + ezio.melotti, mrabarnett
components: + Regular Expressions
2016-09-13 09:06:08vstinnersetnosy: + vstinner
messages: + msg276219
2016-03-01 10:02:30serhiy.storchakasetmessages: + msg261046
2016-02-27 23:10:06terry.reedysetmessages: + msg260950
2016-02-27 16:57:22pitrousetmessages: + msg260943
2016-02-27 16:54:01brett.cannonsetmessages: + msg260942
2016-02-27 10:24:07terry.reedysetnosy: + terry.reedy
messages: + msg260938
2016-02-27 08:40:27serhiy.storchakasetmessages: + msg260932
2016-02-25 17:43:17brett.cannonsetmessages: + msg260873
2016-02-25 17:42:51brett.cannonsetmessages: + msg260872
2016-02-25 12:17:37serhiy.storchakacreate