make built-in tokenizer available via Python C API #47603

effbot · 2008-07-14T11:32:15Z

BPO	3353
Nosy	@amauryfa, @meadori, @berkerpeksag, @serhiy-storchaka, @asottile, @DimitrisJim, @pablogsal
Dependencies	bpo-25643: Python tokenizer rewriting
Files	issue3353.diff: Patch to move the include file etc 82706ea73ada.diff issue3353.patch: issue3353.patch issue3353-2.patch: issue3353-2.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2008-07-14.11:32:15.414>
labels = ['interpreter-core', 'type-feature', '3.7']
title = 'make built-in tokenizer available via Python C API'
updated_at = <Date 2021-01-27.21:14:20.006>
user = 'https://bugs.python.org/effbot'

bugs.python.org fields:

activity = <Date 2021-01-27.21:14:20.006>
actor = 'pablogsal'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Interpreter Core']
creation = <Date 2008-07-14.11:32:15.414>
creator = 'effbot'
dependencies = ['25643']
files = ['10961', '35730', '38992', '38999']
hgrepos = ['260']
issue_num = 3353
keywords = ['patch']
message_count = 34.0
messages = ['69650', '70101', '70102', '70181', '70227', '70305', '143717', '221293', '240882', '240927', '240967', '245939', '289535', '289537', '289584', '289585', '289587', '289590', '289591', '385736', '385756', '385788', '385790', '385791', '385792', '385793', '385794', '385795', '385796', '385797', '385798', '385799', '385808', '385811']
nosy_count = 12.0
nosy_names = ['effbot', 'amaury.forgeotdarc', 'djmitche', 'kirkshorts', 'meador.inge', 'berker.peksag', 'serhiy.storchaka', 'superluser', 'Andrew.C', 'Anthony Sottile', 'Jim Fasarakis-Hilliard', 'pablogsal']
pr_nums = []
priority = 'normal'
resolution = None
stage = 'patch review'
status = 'open'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue3353'
versions = ['Python 3.7']

effbot · 2008-07-14T11:32:10Z

CPython provides a Python-level API to the parser, but not to the
tokenizer itself. Somewhat annoyingly, it does provide a nice C API,
but that's not properly exposed for external modules.

To fix this, the tokenizer.h file should be moved from the Parser
directory to the Include directory, and the (semi-public) functions that
already available must be flagged with PyAPI_FUNC, as shown below.

The PyAPI_FUNC fix should be non-intrusive enough to go into 2.6 and
3.0; moving stuff around is perhaps better left for a later release
(which could also include a Python binding).

Index: tokenizer.h
===================================================================

--- tokenizer.h (revision 514)
+++ tokenizer.h (working copy)
@@ -54,10 +54,10 @@
        const char* str;
 };

-extern struct tok_state *PyTokenizer_FromString(const char *);
-extern struct tok_state *PyTokenizer_FromFile(FILE *, char *, char *);
-extern void PyTokenizer_Free(struct tok_state *);
-extern int PyTokenizer_Get(struct tok_state *, char **, char **);
+PyAPI_FUNC(struct tok_state *) PyTokenizer_FromString(const char *);
+PyAPI_FUNC(struct tok_state *) PyTokenizer_FromFile(FILE *, char *,
char *);
+PyAPI_FUNC(void) PyTokenizer_Free(struct tok_state *);
+PyAPI_FUNC(int) PyTokenizer_Get(struct tok_state *, char **, char **);

 #ifdef __cplusplus
 }

amauryfa · 2008-07-21T10:00:40Z

IMO the "struct tok_state" should not be part of the API, it contains
too many implementation details. Or maybe as an opaque structure.

effbot · 2008-07-21T10:03:45Z

There are a few things in the struct that needs to be public, but that's
nothing that cannot be handled by documentation. No need to complicate
the API just in case.

kirkshorts · 2008-07-23T22:53:02Z

Sorry for the terribly dumb question about this.

Are you meaning that, at this stage, all that is required is:

the application of the PyAPI_FUNC macro
move the file to the Include directory
update Makefile.pre.in to point to the new location

Just I have read this now 10 times or so and keep thinking more must be
involved :-) [certainly given my embarrassing start to the Python dev
community re:asynchronous thread exceptions :-| ]

I have attached a patch that does this. Though at this time it is
lacking any documentation that will state what parts of "struct
tok_state" are private and public. I will need to trawl the code some
more to do that.

I have executed:

./configure
make
make test

And all proceed well.

effbot · 2008-07-24T21:25:36Z

That's should be all that's needed to expose the existing API, as is.
If you want to verify the build, you can grab the pytoken.c and setup.py
files from this directory, and try building the module.

http://svn.effbot.org/public/stuff/sandbox/pytoken/

Make sure you remove the local copy of "tokenizer.h" that's present in
that directory before you build. If that module builds, all's well.

kirkshorts · 2008-07-26T20:59:57Z

Did that and it builds fine.

So my test procedure was:

checkout clean source
apply patch as per guidelines
remove the file Psrser/tokenizer.h (*)
./configure
make
./python setup.py install

Build platform: Ubuntu , gcc 4.2.3

All works fine.

thanks for the extra test files.

- one question though. I removed the file using 'svn remove' but the
  diff makes it an empty file not removed why is that? (and is it correct?)

meadori · 2011-09-08T01:47:33Z

It would be nice if this same C API was used to implement the 'tokenize' module. Issues like bpo-2180 will potentially require bug fixes in two places :-/

AndrewC · 2014-06-22T18:35:28Z

The previously posted patch has become outdated due to signature changes staring with revision 89f4293 on Nov 12, 2009. Attached is an updated patch.

Can it also be confirmed what are the outstanding items for this patch to be applied? Based on the previous logs it's not clear if it's waiting for documentation on the struct tok_state or if there is another change requested. Thanks.

djmitche · 2015-04-14T13:35:08Z

From my read of this bug, there are two distinct tasks mentioned:

make PyTokenizer_* part of the Python-level API
re-implement 'tokenize' in terms of that Python-level API

#1 is largely complete in Andrew's latest patch, but that will likely need:

rebasing
hiding struct fields
documentation

#2 is, I think, a separate project. There may be good reasons *not* to do this which I'm not aware of, and barring such reasons the rewrite will be difficult and could potentially change behavior like bpo-2180. So I would suggest filing a new issue for #2 when #1 is complete. And I'll work on #1.

djmitche · 2015-04-14T16:07:59Z

Here's an updated patch for #1:

Existing Patch:

move tokenizer.h from Parser/ to Include/
Add PyAPI_Func to export tokenizer functions

New:

Removed unused, undefined PyTokenizer_RestoreEncoding
Include PyTokenizer_State with limited ABI compatibility (but still undocumented)
namespace the struct name (PyTokenizer_State)
Documentation

I'd like particular attention to the documentation for the tokenizer -- I'm not entirely confident that I have documented the functions correctly! In particular, I'm not sure how PyTokenizer_FromString handles encodings.

There's a further iteration possible here, but it's beyond my understanding of the tokenizer and of possible uses of the API. That would be to expose some of the tokenizer state fields and document them, either as part of the limited ABI or even the stable API. In particular, there are about a half-dozen struct fields used by the parser, and those would be good candidates for addition to the public API.

If that's desirable, I'd prefer to merge a revision of my patch first, and keep the issue open for subsequent improvement.

djmitche · 2015-04-14T18:06:31Z

New:

rename token symbols in token.h with a PYTOK_ prefix
include an example of using the PyTokenizer functions
address minor review comments

djmitche · 2015-06-29T14:41:47Z

This seems to have stalled out after the PyCon sprints. Any chance the final patch can be reviewed?

DimitrisJim · 2017-03-13T10:03:05Z

Could you submit a PR for this?

I haven't seen any objections to this change, a PR will expose this to more people and a clear decision on whether this change is warranted can be finally made (I hope).

djmitche · 2017-03-13T12:45:42Z

If the patch still applies cleanly, I have no issues with you or anyone opening a PR. I picked this up several years ago at the PyCon sprints, and don't remember a thing about it, nor have I touched any other bit of the CPython source since then. So any merge conflicts would be very difficult for me to resolve.

DimitrisJim · 2017-03-14T13:46:39Z

Okay, I'll take a look at it over the next days and try and submit a PR after fixing any issues that might be present.

serhiy-storchaka · 2017-03-14T13:53:16Z

Please hold this until finishing bpo-25643.

DimitrisJim · 2017-03-14T13:59:44Z

Thanks for linking the dependency, Serhiy :-)

Is there anybody currently working on the other issue? Also, shouldn't both issues now get retagged to Python 3.7?

serhiy-storchaka · 2017-03-14T14:24:49Z

I am working on the other issue (the recent patch is still not published). Sorry, but two issues modify the same code and are conflicting. Since I believe that this issue makes less semantic changes, I think it would be easier to rebase it after finishing bpo-25643 than do it in contrary order.

DimitrisJim · 2017-03-14T14:29:00Z

That makes sense to me, I'll wait around until the dependency is resolved.

asottile · 2021-01-26T21:33:10Z

Serhiy Storchaka is this still blocked? it's been a few years on either this or the linked issue and I'm reaching for this one :)

pablogsal · 2021-01-27T10:49:44Z

I am -1 exposing the C-API of the tokenizer. For the new parser several modifications of the C tokenizer had to be done and some of them modify existing behaviour slightly. I don't want to corner ourselves in a place where we cannot make improvements because is a backwards incompatible change because the API is exposed.

asottile · 2021-01-27T16:47:50Z

I'm interested in it because the tokenize module is painfully slow

pablogsal · 2021-01-27T17:05:53Z

I'm interested in it because the tokenize module is painfully slow

I assumed, but I don't feel confortable exposing the built-in one.

pablogsal · 2021-01-27T17:10:10Z

I assumed, but I don't feel confortable exposing the built-in one.

As an example of the situation, I want to avoid: every time we change anything in the AST because of internal details we have many complains and pressure from tool authors because they need to add branches or because it makes life more difficult for them it and I absolutely want to avoid more of that.

asottile · 2021-01-27T17:18:04Z

you already have that right now because the tokenize module is exposed. (except that every change to the tokenization requires it to be implemented once in C and once in python)

it's much more frustrating when the two differ as well

I don't think all the internals of the C tokenization need to be exposed, my main goals would be:

expose enough information to reimplement Lib/tokenize.py
replace Lib/tokenize.py with the C tokenizer

and the reasons would be:

eliminate the (potential) drift and complexity between the two
get a fast tokenizer

Unlike the AST, the tokenization changes much less frequently (last major addition I can remember is the @ operator

We can hide almost all of the details of the tokenization behind an opaque struct and getter functions

pablogsal · 2021-01-27T17:36:59Z

For reimplementing Lib/tokenize.py we don't need to publicly expose anything in the C-API. We can have a private _tokenize module with uses whatever you need and then you use that _tokenize module in the tokenize.py file to reimplement the exact Python API that the module exposes.

Publicly exposing the headers or APIs opens new boxes of potential problems: ABI stability, changes in the signatures, changes in the structs. Our experience so far with other parts is that almost always is painful to add optimization to internal functions that are partially exposed, so I am still not convinced offering public C-APIs for the builtin tokenizer.

asottile · 2021-01-27T17:43:25Z

private api sounds fine too -- I thought it was necessary to implement the module (as it needs external linkage) but if it isn't then even better

pablogsal · 2021-01-27T17:47:39Z

private api sounds fine too -- I thought it was necessary to implement the module (as it needs external linkage) but if it isn't then even better

We can make it builtin the same way we do for the _ast module, or we can have a new module under Modules (exposing the symbols in the dynamic table) **but** making them private (and not documented), which explicitly goes against what this issue proposes.

asottile · 2021-01-27T18:08:34Z

Either works for me, would you be able to point me to the starting bits as to how _ast becomes builtin?

pablogsal · 2021-01-27T18:13:38Z

Either works for me, would you be able to point me to the starting bits as to how _ast becomes builtin?

https://github.com/python/cpython/blob/master/Python/Python-ast.c#L10075-L10079

and

cpython/PC/config.c

Line 84 in 6329893

struct _inittab _PyImport_Inittab[] = {

But before that I have some questions. For example: How do you plan to implement the readline() interface that tokenize.py uses in the c-module without modifying tokenize.c?

asottile · 2021-01-27T18:19:20Z

I haven't looked into or thought about that yet, it might not be possible

It might also make sense to build new tokenize.py apis avoiding the readline() api -- I always found it painful to work with

pablogsal · 2021-01-27T18:22:12Z

It might also make sense to build new tokenize.py apis avoiding the readline() api -- I always found it painful to work with

Then we would need to maintain the old Python APIs + the new ones using the module? What you are proposing seems more than just speeding up tokenize.py re-using the existing c code

pablogsal · 2021-01-27T20:58:54Z

I have built a draft of how the changes required to make what you describe, in case you want to finish them:

https://github.com/pablogsal/cpython/tree/tokenizer_mod

pablogsal · 2021-01-27T21:14:20Z

Problems that you are going to find:

The c tokenizer throws syntax errors while the tokenizer module does not. For example:

❯ python -c "1_"
File "<string>", line 1
1_
^
SyntaxError: invalid decimal literal

❯ python -m tokenize <<< "1_"
1,0-1,1: NUMBER '1'
1,1-1,2: NAME '_'
1,2-1,3: NEWLINE '\n'
2,0-2,0: ENDMARKER ''

The encoding cannot be immediately specified. You need to thread it in many places.
The readline() function can now return whatever or be whatever, that needs to be handled (better) in the c tokenizer to not crash.
str/bytes in the c tokenizer.
The c tokenizer does not get the full line in some cases or is tricky to get the full line.

lysnikolaou · 2023-06-19T09:17:58Z

Since 3.12 the Python tokenize module uses the C tokenizer internally. Is this enough to close this issue?

pablogsal · 2023-06-19T10:39:19Z

Yup

effbot mannequin added interpreter-core (Objects, Python, Grammar, and Parser dirs) type-feature A feature request or enhancement labels Jul 14, 2008

serhiy-storchaka added the 3.7 (EOL) end of life label Mar 14, 2017

ezio-melotti transferred this issue from another repository Apr 10, 2022

iritkatriel added 3.12 bugs and security fixes and removed 3.7 (EOL) end of life labels Sep 12, 2022

pablogsal closed this as completed Jun 19, 2023

make built-in tokenizer available via Python C API #47603

make built-in tokenizer available via Python C API #47603

Comments

effbot mannequin commented Jul 14, 2008

effbot mannequin commented Jul 14, 2008

amauryfa commented Jul 21, 2008

effbot mannequin commented Jul 21, 2008

kirkshorts mannequin commented Jul 23, 2008

effbot mannequin commented Jul 24, 2008

kirkshorts mannequin commented Jul 26, 2008

meadori commented Sep 8, 2011

AndrewC mannequin commented Jun 22, 2014

djmitche mannequin commented Apr 14, 2015

djmitche mannequin commented Apr 14, 2015

djmitche mannequin commented Apr 14, 2015

djmitche mannequin commented Jun 29, 2015

DimitrisJim mannequin commented Mar 13, 2017

djmitche mannequin commented Mar 13, 2017

DimitrisJim mannequin commented Mar 14, 2017

serhiy-storchaka commented Mar 14, 2017

DimitrisJim mannequin commented Mar 14, 2017

serhiy-storchaka commented Mar 14, 2017

DimitrisJim mannequin commented Mar 14, 2017

asottile mannequin commented Jan 26, 2021

pablogsal commented Jan 27, 2021

asottile mannequin commented Jan 27, 2021

pablogsal commented Jan 27, 2021

pablogsal commented Jan 27, 2021

asottile mannequin commented Jan 27, 2021

pablogsal commented Jan 27, 2021

asottile mannequin commented Jan 27, 2021

pablogsal commented Jan 27, 2021

asottile mannequin commented Jan 27, 2021

pablogsal commented Jan 27, 2021

asottile mannequin commented Jan 27, 2021

pablogsal commented Jan 27, 2021

pablogsal commented Jan 27, 2021

pablogsal commented Jan 27, 2021

lysnikolaou commented Jun 19, 2023

pablogsal commented Jun 19, 2023