Author jacques
Recipients akitada, amaury.forgeotdarc, belopolsky, collinwinter, ezio.melotti, georg.brandl, giampaolo.rodola, gregory.p.smith, jacques, jaylogan, jhalcrow, jimjjewett, loewis, mark, moreati, mrabarnett, nneonneo, pitrou, r.david.murray, rsc, sjmachin, stiv, timehorse, vbr, zdwiel
Date 2010-12-30.00:06:55
SpamBayes Score 1.48407e-07
Marked as misclassified No
Message-id <1293667619.18.0.388569666942.issue2636@psf.upfronthosting.co.za>
In-reply-to
Content
More an observation than a bug:

I understand that we're trading memory for performance, but I've noticed that the peak memory usage is rather high, e.g.:

$ cat test.py
import os
import regex as re

def resident():
    for line in open('/proc/%d/status' % os.getpid(), 'r').readlines():
        if line.startswith("VmRSS:"):
            return line.split(":")[-1].strip()

cache = {}

print resident()
for i in xrange(0,1000):
    cache[i] = re.compile(str(i)+"(abcd12kl|efghlajsdf|ijkllakjsdf|mnoplasjdf|qrstljasd|sdajdwxyzlasjdf|kajsdfjkasdjkf|kasdflkasjdflkajsd|klasdfljasdf)")

print resident()


Execution output on my machine (Linux x86_64, Python 2.6.5):
4328 kB
32052 kB

with the standard regex library:
3688 kB
5428 kB

So, it looks like around 16x the memory per pattern vs standard regex module

Now the example is pretty silly, the difference is even larger for more complex regexes.  I also understand that the once the patterns are GC-ed, python can reuse the memory (pymalloc doesn't return it to the OS, unfortunately).  However, I have some applications that use large numbers (many thousands) of regexes and need to keep them cached (compiled) indefinitely (especially because compilation is expensive).  This causes some pain (long story).

I've played around with increasing RE_MIN_FAST_LENGTH, and it makes a significant difference, e.g.:

RE_MIN_FAST_LENGTH = 10:
4324 kB
25976 kB

In my use-cases, having a larger RE_MIN_FAST_LENGTH doesn't make a huge performance difference, so that might be the way I'll go.
History
Date User Action Args
2010-12-30 00:06:59jacquessetrecipients: + jacques, loewis, georg.brandl, collinwinter, gregory.p.smith, jimjjewett, sjmachin, amaury.forgeotdarc, belopolsky, pitrou, nneonneo, giampaolo.rodola, rsc, timehorse, mark, vbr, ezio.melotti, mrabarnett, jaylogan, akitada, moreati, r.david.murray, zdwiel, jhalcrow, stiv
2010-12-30 00:06:59jacquessetmessageid: <1293667619.18.0.388569666942.issue2636@psf.upfronthosting.co.za>
2010-12-30 00:06:55jacqueslinkissue2636 messages
2010-12-30 00:06:55jacquescreate