classification
Title: make built-in tokenizer available via Python C API
Type: enhancement Stage: patch review
Components: Interpreter Core Versions: Python 3.7
process
Status: open Resolution:
Dependencies: 25643 Superseder:
Assigned To: Nosy List: Andrew.C, Jim Fasarakis-Hilliard, amaury.forgeotdarc, berker.peksag, djmitche, effbot, kirkshorts, meador.inge, serhiy.storchaka, superluser
Priority: normal Keywords: patch

Created on 2008-07-14 11:32 by effbot, last changed 2017-03-14 14:28 by Jim Fasarakis-Hilliard.

Files
File name Uploaded Description Edit
issue3353.diff kirkshorts, 2008-07-23 22:53 Patch to move the include file etc
82706ea73ada.diff Andrew.C, 2014-06-22 18:36 review
issue3353.patch djmitche, 2015-04-14 16:07 issue3353.patch review
issue3353-2.patch djmitche, 2015-04-14 18:06 issue3353-2.patch review
Repositories containing patches
https://bitbucket.org/bloomberg/cpython/#issue3353
Messages (19)
msg69650 - (view) Author: Fredrik Lundh (effbot) * (Python committer) Date: 2008-07-14 11:32
CPython provides a Python-level API to the parser, but not to the
tokenizer itself.  Somewhat annoyingly, it does provide a nice C API,
but that's not properly exposed for external modules.  

To fix this, the tokenizer.h file should be moved from the Parser
directory to the Include directory, and the (semi-public) functions that
already available must be flagged with PyAPI_FUNC, as shown below.

The PyAPI_FUNC fix should be non-intrusive enough to go into 2.6 and
3.0; moving stuff around is perhaps better left for a later release
(which could also include a Python binding).

Index: tokenizer.h
===================================================================
--- tokenizer.h (revision 514)
+++ tokenizer.h (working copy)
@@ -54,10 +54,10 @@
        const char* str;
 };

-extern struct tok_state *PyTokenizer_FromString(const char *);
-extern struct tok_state *PyTokenizer_FromFile(FILE *, char *, char *);
-extern void PyTokenizer_Free(struct tok_state *);
-extern int PyTokenizer_Get(struct tok_state *, char **, char **);
+PyAPI_FUNC(struct tok_state *) PyTokenizer_FromString(const char *);
+PyAPI_FUNC(struct tok_state *) PyTokenizer_FromFile(FILE *, char *,
char *);
+PyAPI_FUNC(void) PyTokenizer_Free(struct tok_state *);
+PyAPI_FUNC(int) PyTokenizer_Get(struct tok_state *, char **, char **);

 #ifdef __cplusplus
 }
msg70101 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2008-07-21 10:00
IMO the "struct tok_state" should not be part of the API, it contains
too many implementation details. Or maybe as an opaque structure.
msg70102 - (view) Author: Fredrik Lundh (effbot) * (Python committer) Date: 2008-07-21 10:03
There are a few things in the struct that needs to be public, but that's
nothing that cannot be handled by documentation.  No need to complicate
the API just in case.
msg70181 - (view) Author: Andy (kirkshorts) Date: 2008-07-23 22:53
Sorry for the terribly dumb question about this.

Are you meaning that, at this stage, all that is required is:

 1. the application of the PyAPI_FUNC macro
 2. move the file to the Include directory
 3. update Makefile.pre.in to point to the new location

Just I have read this now 10 times or so and keep thinking more must be
involved :-) [certainly given my embarrassing start to the Python dev
community re:asynchronous thread exceptions :-| ]

I have attached a patch that does this. Though at this time it is
lacking any documentation that will state what parts of "struct
tok_state" are private and public. I will need to trawl the code some
more to do that.

I have executed:

 - ./configure
 - make
 - make test

And all proceed well.
msg70227 - (view) Author: Fredrik Lundh (effbot) * (Python committer) Date: 2008-07-24 21:25
That's should be all that's needed to expose the existing API, as is. 
If you want to verify the build, you can grab the pytoken.c and setup.py
files from this directory, and try building the module.

http://svn.effbot.org/public/stuff/sandbox/pytoken/

Make sure you remove the local copy of "tokenizer.h" that's present in
that directory before you build.  If that module builds, all's well.
msg70305 - (view) Author: Andy (kirkshorts) Date: 2008-07-26 20:59
Did that and it builds fine.

So my test procedure was:

 - checkout clean source
 - apply patch as per guidelines
 - remove the file Psrser/tokenizer.h (*)
 - ./configure
 - make
 - ./python setup.py install

Build platform: Ubuntu , gcc 4.2.3

All works fine.

thanks for the extra test files.

* - one question though. I removed the file using 'svn remove' but the
diff makes it an empty file not removed why is that? (and is it correct?)
msg143717 - (view) Author: Meador Inge (meador.inge) * (Python committer) Date: 2011-09-08 01:47
It would be nice if this same C API was used to implement the 'tokenize' module.  Issues like issue2180 will potentially require bug fixes in two places :-/
msg221293 - (view) Author: Andrew C (Andrew.C) * Date: 2014-06-22 18:35
The previously posted patch has become outdated due to signature changes staring with revision 89f4293 on Nov 12, 2009.  Attached is an updated patch.

Can it also be confirmed what are the outstanding items for this patch to be applied?  Based on the previous logs it's not clear if it's waiting for documentation on the struct tok_state or if there is another change requested.  Thanks.
msg240882 - (view) Author: Dustin J. Mitchell (djmitche) * Date: 2015-04-14 13:35
From my read of this bug, there are two distinct tasks mentioned:

 1. make PyTokenizer_* part of the Python-level API
 2. re-implement 'tokenize' in terms of that Python-level API

#1 is largely complete in Andrew's latest patch, but that will likely need:
 * rebasing
 * hiding struct fields
 * documentation

#2 is, I think, a separate project.  There may be good reasons *not* to do this which I'm not aware of, and barring such reasons the rewrite will be difficult and could potentially change behavior like issue2180.  So I would suggest filing a new issue for #2 when #1 is complete.  And I'll work on #1.
msg240927 - (view) Author: Dustin J. Mitchell (djmitche) * Date: 2015-04-14 16:07
Here's an updated patch for #1:

Existing Patch:
 - move tokenizer.h from Parser/ to Include/
 - Add PyAPI_Func to export tokenizer functions

New:
 - Removed unused, undefined PyTokenizer_RestoreEncoding
 - Include PyTokenizer_State with limited ABI compatibility (but still undocumented)
 - namespace the struct name (PyTokenizer_State)
 - Documentation

I'd like particular attention to the documentation for the tokenizer -- I'm not entirely confident that I have documented the functions correctly!  In particular, I'm not sure how PyTokenizer_FromString handles encodings.

There's a further iteration possible here, but it's beyond my understanding of the tokenizer and of possible uses of the API. That would be to expose some of the tokenizer state fields and document them, either as part of the limited ABI or even the stable API.  In particular, there are about a half-dozen struct fields used by the parser, and those would be good candidates for addition to the public API.

If that's desirable, I'd prefer to merge a revision of my patch first, and keep the issue open for subsequent improvement.
msg240967 - (view) Author: Dustin J. Mitchell (djmitche) * Date: 2015-04-14 18:06
New:
 - rename token symbols in token.h with a PYTOK_ prefix
 - include an example of using the PyTokenizer functions
 - address minor review comments
msg245939 - (view) Author: Dustin J. Mitchell (djmitche) * Date: 2015-06-29 14:41
This seems to have stalled out after the PyCon sprints.  Any chance the final patch can be reviewed?
msg289535 - (view) Author: Jim Fasarakis-Hilliard (Jim Fasarakis-Hilliard) * Date: 2017-03-13 10:03
Could you submit a PR for this? 

I haven't seen any objections to this change, a PR will expose this to more people and a clear decision on whether this change is warranted can be finally made (I hope).
msg289537 - (view) Author: Dustin J. Mitchell (djmitche) * Date: 2017-03-13 12:45
If the patch still applies cleanly, I have no issues with you or anyone opening a PR.  I picked this up several years ago at the PyCon sprints, and don't remember a thing about it, nor have I touched any other bit of the CPython source since then.  So any merge conflicts would be very difficult for me to resolve.
msg289584 - (view) Author: Jim Fasarakis-Hilliard (Jim Fasarakis-Hilliard) * Date: 2017-03-14 13:46
Okay, I'll take a look at it over the next days and try and submit a PR after fixing any issues that might be present.
msg289585 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-03-14 13:53
Please hold this until finishing issue25643.
msg289587 - (view) Author: Jim Fasarakis-Hilliard (Jim Fasarakis-Hilliard) * Date: 2017-03-14 13:59
Thanks for linking the dependency, Serhiy :-) 

Is there anybody currently working on the other issue? Also, shouldn't both issues now get retagged to Python 3.7?
msg289590 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-03-14 14:24
I am working on the other issue (the recent patch is still not published). Sorry, but two issues modify the same code and are conflicting. Since I believe that this issue makes less semantic changes, I think it would be easier to rebase it after finishing issue25643 than do it in contrary order.
msg289591 - (view) Author: Jim Fasarakis-Hilliard (Jim Fasarakis-Hilliard) * Date: 2017-03-14 14:28
That makes sense to me, I'll wait around until the dependency is resolved.
History
Date User Action Args
2017-03-14 14:28:59Jim Fasarakis-Hilliardsetmessages: + msg289591
2017-03-14 14:24:49serhiy.storchakasetmessages: + msg289590
versions: + Python 3.7, - Python 3.6
2017-03-14 13:59:44Jim Fasarakis-Hilliardsetmessages: + msg289587
2017-03-14 13:53:15serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg289585
2017-03-14 13:52:27serhiy.storchakasetdependencies: + Python tokenizer rewriting
2017-03-14 13:46:39Jim Fasarakis-Hilliardsetmessages: + msg289584
2017-03-13 12:45:42djmitchesetmessages: + msg289537
2017-03-13 10:03:04Jim Fasarakis-Hilliardsetnosy: + Jim Fasarakis-Hilliard
messages: + msg289535
2015-11-14 12:36:26berker.peksagsetnosy: + berker.peksag

versions: + Python 3.6, - Python 3.5
2015-11-06 03:12:09superlusersetnosy: + superluser
2015-06-29 14:41:47djmitchesetmessages: + msg245939
2015-04-15 02:36:41ned.deilysetstage: test needed -> patch review
2015-04-14 18:06:33djmitchesetfiles: + issue3353-2.patch

messages: + msg240967
2015-04-14 16:08:00djmitchesetfiles: + issue3353.patch

messages: + msg240927
2015-04-14 13:35:08djmitchesetnosy: + djmitche
messages: + msg240882
2014-06-22 18:36:54Andrew.Csetfiles: + 82706ea73ada.diff
2014-06-22 18:35:28Andrew.Csethgrepos: + hgrepo260

messages: + msg221293
nosy: + Andrew.C
2014-06-20 16:09:30zach.waresetversions: + Python 3.5, - Python 3.2
2011-09-08 01:47:33meador.ingesetnosy: + meador.inge
messages: + msg143717
2010-08-09 18:36:46terry.reedysetversions: - Python 2.7
2009-05-16 20:35:06ajaksu2setpriority: normal
stage: test needed
versions: + Python 3.2, - Python 2.6, Python 3.0
2008-07-26 20:59:58kirkshortssetmessages: + msg70305
2008-07-24 21:25:36effbotsetmessages: + msg70227
2008-07-23 22:53:05kirkshortssetfiles: + issue3353.diff
nosy: + kirkshorts
messages: + msg70181
keywords: + patch
2008-07-21 10:03:44effbotsetmessages: + msg70102
2008-07-21 10:00:39amaury.forgeotdarcsetnosy: + amaury.forgeotdarc
messages: + msg70101
2008-07-14 11:32:15effbotcreate