Message 260950 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	terry.reedy
Recipients	brett.cannon, pitrou, serhiy.storchaka, terry.reedy
Date	2016-02-27.23:10:05
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1456614606.17.0.631973936078.issue26436@psf.upfronthosting.co.za>
In-reply-to

Content
DNA matching can be done with difflib. Serious high-volume work should use compiled specialized matchers and aligners. This particular benchmark, explained a bit at https://benchmarksgame.alioth.debian.org/u64q/regexdna-description.html#regexdna, manipulates and searches standard FASTA format representations of sequences with the regex available in each language. (The site has another Python implementation at https://benchmarksgame.alioth.debian.org/u64q/program.php?test=regexdna&lang=python3&id=1. It uses unicode strings rather than bytes, and multiprocessing.Pool to run re.findall in parallel.) FASTA uses lowercase a,c,g,t for known bases and at least 11 uppercase letters for subsets of bases representing partially known bases. The third task is to expand upper case letters to subsets of lowercase letters. Since the rules requires use of re and one substitution at a time, the 2 Python programs run re.sub over the current sequence 11 times. More idiomatic for Python, and probably faster, would be to use seq.replace(old,new) instead. Perhaps even more idiomatic and probably faster still, would be to use str.translate, as in this reduced example. >>> table = {ord('B') : '(c\|g\|t)', ord('D') : '(a\|g\|t)'} >>> 'aBcDg'.translate(table) 'a(c\|g\|t)c(a\|g\|t)g'

DNA matching can be done with difflib.  Serious high-volume work should use compiled specialized matchers and aligners.

This particular benchmark, explained a bit at https://benchmarksgame.alioth.debian.org/u64q/regexdna-description.html#regexdna, manipulates and searches standard FASTA format representations of sequences with the regex available in each language.  (The site has another Python implementation at https://benchmarksgame.alioth.debian.org/u64q/program.php?test=regexdna&lang=python3&id=1. It uses unicode strings rather than bytes, and multiprocessing.Pool to run re.findall in parallel.)

FASTA uses lowercase a,c,g,t for known bases and at least 11 uppercase letters for subsets of bases representing partially known bases.  The third task is to expand upper case letters to subsets of lowercase letters.  Since the rules requires use of re and one substitution at a time, the 2 Python programs run re.sub over the current sequence 11 times.  More idiomatic for Python, and probably faster, would be to use seq.replace(old,new) instead.  Perhaps even more idiomatic and probably faster still, would be to use str.translate, as in this reduced example.

>>> table = {ord('B') : '(c|g|t)', ord('D') : '(a|g|t)'}
>>> 'aBcDg'.translate(table)
'a(c|g|t)c(a|g|t)g'

History
Date	User	Action	Args
2016-02-27 23:10:06	terry.reedy	set	recipients: + terry.reedy, brett.cannon, pitrou, serhiy.storchaka
2016-02-27 23:10:06	terry.reedy	set	messageid: <1456614606.17.0.631973936078.issue26436@psf.upfronthosting.co.za>
2016-02-27 23:10:06	terry.reedy	link	issue26436 messages
2016-02-27 23:10:05	terry.reedy	create