This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author jpeel
Recipients jpeel, rhettinger
Date 2010-12-09.22:58:56
SpamBayes Score 4.3531845e-13
Marked as misclassified No
Message-id <1291935540.06.0.127378209926.issue10667@psf.upfronthosting.co.za>
In-reply-to
Content
I put the Counter's update and __missing__ methods into C code. The rest of the existing methods remain the same. The new Counter is at least 2x-10x (or more) faster for updating depending on the input. I also added a new method, update_fromsubs, which is documented below the timings for updating a Counter.

Here are a few examples to demonstrate the speed of the new update function:

Counting list(range(100))*100 - 
    Old Counter: 18.6 msec
    New Counter: 1.43
13x speed-up

Counting range(10000) - 
    Old Counter: 21.5 msec
    New Counter: 3.84 msec
5.6x speed-up

Counter generator  i % 100 for i in range(10000) - 
    Old Counter: 36.7 msec
    New Counter: 17.5 msec
2.1x speed-up

and the __missing__ method being in C makes getting missing keys about twice as fast.

Also, I added a new method, update_fromsubs, which counts subarrays/substrings of a sequence. This method only works with immutable sequences like tuples, bytes, and unicode objects. The method is of the form

update_fromsubs(seq, frame[, step[, lo[, hi]]])

where frame is the size of the subarrays, step is how far to move forward after each subarray, and lo and hi are the respective starting and ending indices to count subarrays within the sequence.

For instance,

    c = Counter()
    c.update_fromsubs('abcd', 2)

yields

    Counter({'cd': 1, 'ab': 1, 'bc': 1})

and

    d = Counter()
    d.update_fromsubs('abcd', 2, 2)

yields

    Counter({'ab': 1, 'cd: 1})

These sorts of operations could be done with generators in a manner like

    Counter(s[i:i+2] for i in range(0, len(s)-1))

but it can be about twice as fast by using update_fromsubs. Here's an example

Counting subarrays of length 2 of "ab"*10000:
    update_fromsubs:            30.8 ms
    New Counter w/ generator:   54.3 ms
    Old Counter w/ generator:   98.8 ms

This method would probably be most useful in processing strings, but there may be other uses. If it is accepted, please feel to change the method name and parameters. I especially wasn't sure what to call this method (update_fromsubsequences seemed rather long).
History
Date User Action Args
2010-12-09 22:59:00jpeelsetrecipients: + jpeel, rhettinger
2010-12-09 22:59:00jpeelsetmessageid: <1291935540.06.0.127378209926.issue10667@psf.upfronthosting.co.za>
2010-12-09 22:58:58jpeellinkissue10667 messages
2010-12-09 22:58:57jpeelcreate