Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Module containing C implementations of common text algorithms #46311

Closed
mchaput mannequin opened this issue Feb 7, 2008 · 8 comments
Closed

Module containing C implementations of common text algorithms #46311

mchaput mannequin opened this issue Feb 7, 2008 · 8 comments
Labels
extension-modules C modules in the Modules dir topic-unicode type-feature A feature request or enhancement

Comments

@mchaput
Copy link
Mannequin

mchaput mannequin commented Feb 7, 2008

BPO 2027
Nosy @birkenfeld, @amauryfa, @vstinner, @tiran

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2010-09-20.22:08:15.937>
created_at = <Date 2008-02-07.04:45:40.518>
labels = ['extension-modules', 'type-feature', 'expert-unicode']
title = 'Module containing C implementations of common text algorithms'
updated_at = <Date 2010-09-21.03:49:28.731>
user = 'https://bugs.python.org/mchaput'

bugs.python.org fields:

activity = <Date 2010-09-21.03:49:28.731>
actor = 'mrabarnett'
assignee = 'none'
closed = True
closed_date = <Date 2010-09-20.22:08:15.937>
closer = 'benjamin.peterson'
components = ['Extension Modules', 'Unicode']
creation = <Date 2008-02-07.04:45:40.518>
creator = 'mchaput'
dependencies = []
files = []
hgrepos = []
issue_num = 2027
keywords = []
message_count = 8.0
messages = ['62134', '62138', '62161', '62187', '62199', '106281', '116944', '117028']
nosy_count = 7.0
nosy_names = ['georg.brandl', 'amaury.forgeotdarc', 'mchaput', 'vstinner', 'christian.heimes', 'mrabarnett', 'BreamoreBoy']
pr_nums = []
priority = 'low'
resolution = 'rejected'
stage = 'test needed'
status = 'closed'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue2027'
versions = ['Python 2.7', 'Python 3.2']

@mchaput
Copy link
Mannequin Author

mchaput mannequin commented Feb 7, 2008

Add a module to the standard library containing fast (C) implementations
of common text/language related algorithms, to begin specifically Porter
(and perhaps other) stemming and Levenshtein (and perhaps other) edit
distance. Both these algorithms are useful in multiple domains, well
known and understood, and have sample implementations all over the Web,
but are compute-intensive and prohibitively expensive when implemented
in pure Python.

@mchaput mchaput mannequin added stdlib Python modules in the Lib dir type-feature A feature request or enhancement labels Feb 7, 2008
@amauryfa
Copy link
Member

amauryfa commented Feb 7, 2008

I don't think that this should be part of the core standard library.
Did you look at the TextIndexNG project?
http://opensource.zopyx.com/projects/TextIndexNG3/

@tiran
Copy link
Member

tiran commented Feb 7, 2008

I agree with Amaury. Pyhton uses the slogan "batteries included" and not
"fusion reactor included". We can and will never include every library
that may be useful for some users. Python core's development cycles are
too slow for fast moving software. Andreas' TXNG3 contains fine
implementations for stemming and levenstein.

@mchaput
Copy link
Mannequin Author

mchaput mannequin commented Feb 8, 2008

The Porter stemming and Levenshtein edit-distance algorithms are not
"fast-moving" nor are they fusion reactors... they've been around
forever, and are simple to implement, but are still useful in various
common scenarios. I'd say this is similar to Python including an
implementation of digest functions such as SHA: it's useful enough, and
compute-intensive enough, to warrant a C implementation. Shipping C
extensions is not an option for everyone; it's especially a pain with
Windows.

@birkenfeld
Copy link
Member

Even PHP includes Levenshtein... ;)

@devdanzin devdanzin mannequin added extension-modules C modules in the Modules dir topic-unicode and removed stdlib Python modules in the Lib dir labels May 16, 2009
@vstinner
Copy link
Member

Before having a optimized version of common test algorithms, why not starting by a Python? Write and maintain C code is harder, and I'm not sure that performances are critical for such algorithm.

This issue has no patch: if nobody provides a patch, I will close it because I agree with Amaury and Christian (this issue can be solved by an 3rd party module: such module can be written in C).

@BreamoreBoy
Copy link
Mannequin

BreamoreBoy mannequin commented Sep 20, 2010

I'll close this as suggested in msg106281 in a couple of weeks unless someone objects.

@mrabarnett
Copy link
Mannequin

mrabarnett mannequin commented Sep 21, 2010

I've started on a module called 'texttools'. So far it has Levenshtein and Porter (both coded in C).

If there's interest I'll put it on PyPI.

Suggestions for other additions?

@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
extension-modules C modules in the Modules dir topic-unicode type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

5 participants