classification
Title: Module containing C implementations of common text algorithms
Type: enhancement Stage: test needed
Components: Extension Modules, Unicode Versions: Python 3.2, Python 2.7
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: Nosy List: BreamoreBoy, amaury.forgeotdarc, christian.heimes, georg.brandl, mchaput, mrabarnett, vstinner
Priority: low Keywords:

Created on 2008-02-07 04:45 by mchaput, last changed 2010-09-21 03:49 by mrabarnett. This issue is now closed.

Messages (8)
msg62134 - (view) Author: Matt Chaput (mchaput) Date: 2008-02-07 04:45
Add a module to the standard library containing fast (C) implementations 
of common text/language related algorithms, to begin specifically Porter 
(and perhaps other) stemming and Levenshtein (and perhaps other) edit 
distance. Both these algorithms are useful in multiple domains, well 
known and understood, and have sample implementations all over the Web, 
but are compute-intensive and prohibitively expensive when implemented 
in pure Python.
msg62138 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2008-02-07 09:42
I don't think that this should be part of the core standard library.
Did you look at the TextIndexNG project?
http://opensource.zopyx.com/projects/TextIndexNG3/
msg62161 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2008-02-07 17:49
I agree with Amaury. Pyhton uses the slogan "batteries included" and not
"fusion reactor included". We can and will never include every library
that may be useful for some users. Python core's development cycles are
too slow for fast moving software. Andreas' TXNG3 contains fine
implementations for stemming and levenstein.
msg62187 - (view) Author: Matt Chaput (mchaput) Date: 2008-02-08 00:27
The Porter stemming and Levenshtein edit-distance algorithms are not
"fast-moving" nor are they fusion reactors... they've been around
forever, and are simple to implement, but are still useful in various
common scenarios. I'd say this is similar to Python including an
implementation of digest functions such as SHA: it's useful enough, and
compute-intensive enough, to warrant a C implementation. Shipping C
extensions is not an option for everyone; it's especially a pain with
Windows.
msg62199 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2008-02-08 15:04
Even PHP includes Levenshtein... ;)
msg106281 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-05-22 01:50
Before having a optimized version of common test algorithms, why not starting by a Python? Write and maintain C code is harder, and I'm not sure that performances are critical for such algorithm.

This issue has no patch: if nobody provides a patch, I will close it because I agree with Amaury and Christian (this issue can be solved by an 3rd party module: such module can be written in C).
msg116944 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2010-09-20 14:27
I'll close this as suggested in msg106281 in a couple of weeks unless someone objects.
msg117028 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2010-09-21 03:49
I've started on a module called 'texttools'. So far it has Levenshtein and Porter (both coded in C).

If there's interest I'll put it on PyPI.

Suggestions for other additions?
History
Date User Action Args
2010-09-21 03:49:28mrabarnettsetnosy: + mrabarnett
messages: + msg117028
2010-09-20 22:08:15benjamin.petersonsetstatus: pending -> closed
resolution: rejected
2010-09-20 14:27:13BreamoreBoysetstatus: open -> pending
nosy: + BreamoreBoy
messages: + msg116944

2010-05-22 01:50:18vstinnersetnosy: + vstinner
messages: + msg106281
2009-05-16 19:37:45ajaksu2setpriority: normal -> low
nosy: georg.brandl, amaury.forgeotdarc, mchaput, christian.heimes
versions: + Python 2.7, Python 3.2, - Python 2.6
components: + Extension Modules, Unicode, - Library (Lib)
stage: test needed
2008-02-08 15:04:21georg.brandlsetnosy: + georg.brandl
messages: + msg62199
2008-02-08 00:27:54mchaputsetmessages: + msg62187
2008-02-07 17:49:44christian.heimessetpriority: normal
nosy: + christian.heimes
messages: + msg62161
versions: + Python 2.6
2008-02-07 09:43:01amaury.forgeotdarcsetnosy: + amaury.forgeotdarc
messages: + msg62138
2008-02-07 04:45:40mchaputcreate