New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
make built-in tokenizer available via Python C API #47603
Comments
CPython provides a Python-level API to the parser, but not to the To fix this, the tokenizer.h file should be moved from the Parser The PyAPI_FUNC fix should be non-intrusive enough to go into 2.6 and Index: tokenizer.h --- tokenizer.h (revision 514)
+++ tokenizer.h (working copy)
@@ -54,10 +54,10 @@
const char* str;
}; -extern struct tok_state *PyTokenizer_FromString(const char *);
-extern struct tok_state *PyTokenizer_FromFile(FILE *, char *, char *);
-extern void PyTokenizer_Free(struct tok_state *);
-extern int PyTokenizer_Get(struct tok_state *, char **, char **);
+PyAPI_FUNC(struct tok_state *) PyTokenizer_FromString(const char *);
+PyAPI_FUNC(struct tok_state *) PyTokenizer_FromFile(FILE *, char *,
char *);
+PyAPI_FUNC(void) PyTokenizer_Free(struct tok_state *);
+PyAPI_FUNC(int) PyTokenizer_Get(struct tok_state *, char **, char **); #ifdef __cplusplus
} |
IMO the "struct tok_state" should not be part of the API, it contains |
There are a few things in the struct that needs to be public, but that's |
Sorry for the terribly dumb question about this. Are you meaning that, at this stage, all that is required is:
Just I have read this now 10 times or so and keep thinking more must be I have attached a patch that does this. Though at this time it is I have executed:
And all proceed well. |
That's should be all that's needed to expose the existing API, as is. http://svn.effbot.org/public/stuff/sandbox/pytoken/ Make sure you remove the local copy of "tokenizer.h" that's present in |
Did that and it builds fine. So my test procedure was:
Build platform: Ubuntu , gcc 4.2.3 All works fine. thanks for the extra test files.
|
It would be nice if this same C API was used to implement the 'tokenize' module. Issues like bpo-2180 will potentially require bug fixes in two places :-/ |
The previously posted patch has become outdated due to signature changes staring with revision 89f4293 on Nov 12, 2009. Attached is an updated patch. Can it also be confirmed what are the outstanding items for this patch to be applied? Based on the previous logs it's not clear if it's waiting for documentation on the struct tok_state or if there is another change requested. Thanks. |
From my read of this bug, there are two distinct tasks mentioned:
#1 is largely complete in Andrew's latest patch, but that will likely need:
#2 is, I think, a separate project. There may be good reasons *not* to do this which I'm not aware of, and barring such reasons the rewrite will be difficult and could potentially change behavior like bpo-2180. So I would suggest filing a new issue for #2 when #1 is complete. And I'll work on #1. |
Here's an updated patch for #1: Existing Patch:
New:
I'd like particular attention to the documentation for the tokenizer -- I'm not entirely confident that I have documented the functions correctly! In particular, I'm not sure how PyTokenizer_FromString handles encodings. There's a further iteration possible here, but it's beyond my understanding of the tokenizer and of possible uses of the API. That would be to expose some of the tokenizer state fields and document them, either as part of the limited ABI or even the stable API. In particular, there are about a half-dozen struct fields used by the parser, and those would be good candidates for addition to the public API. If that's desirable, I'd prefer to merge a revision of my patch first, and keep the issue open for subsequent improvement. |
New:
|
This seems to have stalled out after the PyCon sprints. Any chance the final patch can be reviewed? |
Could you submit a PR for this? I haven't seen any objections to this change, a PR will expose this to more people and a clear decision on whether this change is warranted can be finally made (I hope). |
If the patch still applies cleanly, I have no issues with you or anyone opening a PR. I picked this up several years ago at the PyCon sprints, and don't remember a thing about it, nor have I touched any other bit of the CPython source since then. So any merge conflicts would be very difficult for me to resolve. |
Okay, I'll take a look at it over the next days and try and submit a PR after fixing any issues that might be present. |
Please hold this until finishing bpo-25643. |
Thanks for linking the dependency, Serhiy :-) Is there anybody currently working on the other issue? Also, shouldn't both issues now get retagged to Python 3.7? |
I am working on the other issue (the recent patch is still not published). Sorry, but two issues modify the same code and are conflicting. Since I believe that this issue makes less semantic changes, I think it would be easier to rebase it after finishing bpo-25643 than do it in contrary order. |
That makes sense to me, I'll wait around until the dependency is resolved. |
Serhiy Storchaka is this still blocked? it's been a few years on either this or the linked issue and I'm reaching for this one :) |
I am -1 exposing the C-API of the tokenizer. For the new parser several modifications of the C tokenizer had to be done and some of them modify existing behaviour slightly. I don't want to corner ourselves in a place where we cannot make improvements because is a backwards incompatible change because the API is exposed. |
I'm interested in it because the |
I assumed, but I don't feel confortable exposing the built-in one. |
As an example of the situation, I want to avoid: every time we change anything in the AST because of internal details we have many complains and pressure from tool authors because they need to add branches or because it makes life more difficult for them it and I absolutely want to avoid more of that. |
you already have that right now because the it's much more frustrating when the two differ as well I don't think all the internals of the C tokenization need to be exposed, my main goals would be:
and the reasons would be:
Unlike the AST, the tokenization changes much less frequently (last major addition I can remember is the We can hide almost all of the details of the tokenization behind an opaque struct and getter functions |
For reimplementing Lib/tokenize.py we don't need to publicly expose anything in the C-API. We can have a private _tokenize module with uses whatever you need and then you use that _tokenize module in the tokenize.py file to reimplement the exact Python API that the module exposes. Publicly exposing the headers or APIs opens new boxes of potential problems: ABI stability, changes in the signatures, changes in the structs. Our experience so far with other parts is that almost always is painful to add optimization to internal functions that are partially exposed, so I am still not convinced offering public C-APIs for the builtin tokenizer. |
private api sounds fine too -- I thought it was necessary to implement the module (as it needs external linkage) but if it isn't then even better |
We can make it builtin the same way we do for the _ast module, or we can have a new module under Modules (exposing the symbols in the dynamic table) **but** making them private (and not documented), which explicitly goes against what this issue proposes. |
Either works for me, would you be able to point me to the starting bits as to how |
https://github.com/python/cpython/blob/master/Python/Python-ast.c#L10075-L10079 and Line 84 in 6329893
But before that I have some questions. For example: How do you plan to implement the readline() interface that tokenize.py uses in the c-module without modifying tokenize.c? |
I haven't looked into or thought about that yet, it might not be possible It might also make sense to build new tokenize.py apis avoiding the |
Then we would need to maintain the old Python APIs + the new ones using the module? What you are proposing seems more than just speeding up tokenize.py re-using the existing c code |
I have built a draft of how the changes required to make what you describe, in case you want to finish them: |
Problems that you are going to find:
❯ python -c "1_" ❯ python -m tokenize <<< "1_"
|
Since 3.12 the Python |
Yup |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: