This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author terry.reedy
Recipients brett.cannon, pitrou, serhiy.storchaka, terry.reedy
Date 2016-02-27.23:10:05
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1456614606.17.0.631973936078.issue26436@psf.upfronthosting.co.za>
In-reply-to
Content
DNA matching can be done with difflib.  Serious high-volume work should use compiled specialized matchers and aligners.

This particular benchmark, explained a bit at https://benchmarksgame.alioth.debian.org/u64q/regexdna-description.html#regexdna, manipulates and searches standard FASTA format representations of sequences with the regex available in each language.  (The site has another Python implementation at https://benchmarksgame.alioth.debian.org/u64q/program.php?test=regexdna&lang=python3&id=1. It uses unicode strings rather than bytes, and multiprocessing.Pool to run re.findall in parallel.)

FASTA uses lowercase a,c,g,t for known bases and at least 11 uppercase letters for subsets of bases representing partially known bases.  The third task is to expand upper case letters to subsets of lowercase letters.  Since the rules requires use of re and one substitution at a time, the 2 Python programs run re.sub over the current sequence 11 times.  More idiomatic for Python, and probably faster, would be to use seq.replace(old,new) instead.  Perhaps even more idiomatic and probably faster still, would be to use str.translate, as in this reduced example.

>>> table = {ord('B') : '(c|g|t)', ord('D') : '(a|g|t)'}
>>> 'aBcDg'.translate(table)
'a(c|g|t)c(a|g|t)g'
History
Date User Action Args
2016-02-27 23:10:06terry.reedysetrecipients: + terry.reedy, brett.cannon, pitrou, serhiy.storchaka
2016-02-27 23:10:06terry.reedysetmessageid: <1456614606.17.0.631973936078.issue26436@psf.upfronthosting.co.za>
2016-02-27 23:10:06terry.reedylinkissue26436 messages
2016-02-27 23:10:05terry.reedycreate