Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make built-in tokenizer available via Python C API #47603

Closed
effbot mannequin opened this issue Jul 14, 2008 · 36 comments
Closed

make built-in tokenizer available via Python C API #47603

effbot mannequin opened this issue Jul 14, 2008 · 36 comments
Labels
3.12 bugs and security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) type-feature A feature request or enhancement

Comments

@effbot
Copy link
Mannequin

effbot mannequin commented Jul 14, 2008

BPO 3353
Nosy @amauryfa, @meadori, @berkerpeksag, @serhiy-storchaka, @asottile, @DimitrisJim, @pablogsal
Dependencies
  • bpo-25643: Python tokenizer rewriting
  • Files
  • issue3353.diff: Patch to move the include file etc
  • 82706ea73ada.diff
  • issue3353.patch: issue3353.patch
  • issue3353-2.patch: issue3353-2.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2008-07-14.11:32:15.414>
    labels = ['interpreter-core', 'type-feature', '3.7']
    title = 'make built-in tokenizer available via Python C API'
    updated_at = <Date 2021-01-27.21:14:20.006>
    user = 'https://bugs.python.org/effbot'

    bugs.python.org fields:

    activity = <Date 2021-01-27.21:14:20.006>
    actor = 'pablogsal'
    assignee = 'none'
    closed = False
    closed_date = None
    closer = None
    components = ['Interpreter Core']
    creation = <Date 2008-07-14.11:32:15.414>
    creator = 'effbot'
    dependencies = ['25643']
    files = ['10961', '35730', '38992', '38999']
    hgrepos = ['260']
    issue_num = 3353
    keywords = ['patch']
    message_count = 34.0
    messages = ['69650', '70101', '70102', '70181', '70227', '70305', '143717', '221293', '240882', '240927', '240967', '245939', '289535', '289537', '289584', '289585', '289587', '289590', '289591', '385736', '385756', '385788', '385790', '385791', '385792', '385793', '385794', '385795', '385796', '385797', '385798', '385799', '385808', '385811']
    nosy_count = 12.0
    nosy_names = ['effbot', 'amaury.forgeotdarc', 'djmitche', 'kirkshorts', 'meador.inge', 'berker.peksag', 'serhiy.storchaka', 'superluser', 'Andrew.C', 'Anthony Sottile', 'Jim Fasarakis-Hilliard', 'pablogsal']
    pr_nums = []
    priority = 'normal'
    resolution = None
    stage = 'patch review'
    status = 'open'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue3353'
    versions = ['Python 3.7']

    @effbot
    Copy link
    Mannequin Author

    effbot mannequin commented Jul 14, 2008

    CPython provides a Python-level API to the parser, but not to the
    tokenizer itself. Somewhat annoyingly, it does provide a nice C API,
    but that's not properly exposed for external modules.

    To fix this, the tokenizer.h file should be moved from the Parser
    directory to the Include directory, and the (semi-public) functions that
    already available must be flagged with PyAPI_FUNC, as shown below.

    The PyAPI_FUNC fix should be non-intrusive enough to go into 2.6 and
    3.0; moving stuff around is perhaps better left for a later release
    (which could also include a Python binding).

    Index: tokenizer.h
    ===================================================================

    --- tokenizer.h (revision 514)
    +++ tokenizer.h (working copy)
    @@ -54,10 +54,10 @@
            const char* str;
     };
    -extern struct tok_state *PyTokenizer_FromString(const char *);
    -extern struct tok_state *PyTokenizer_FromFile(FILE *, char *, char *);
    -extern void PyTokenizer_Free(struct tok_state *);
    -extern int PyTokenizer_Get(struct tok_state *, char **, char **);
    +PyAPI_FUNC(struct tok_state *) PyTokenizer_FromString(const char *);
    +PyAPI_FUNC(struct tok_state *) PyTokenizer_FromFile(FILE *, char *,
    char *);
    +PyAPI_FUNC(void) PyTokenizer_Free(struct tok_state *);
    +PyAPI_FUNC(int) PyTokenizer_Get(struct tok_state *, char **, char **);
     #ifdef __cplusplus
     }

    @effbot effbot mannequin added interpreter-core (Objects, Python, Grammar, and Parser dirs) type-feature A feature request or enhancement labels Jul 14, 2008
    @amauryfa
    Copy link
    Member

    IMO the "struct tok_state" should not be part of the API, it contains
    too many implementation details. Or maybe as an opaque structure.

    @effbot
    Copy link
    Mannequin Author

    effbot mannequin commented Jul 21, 2008

    There are a few things in the struct that needs to be public, but that's
    nothing that cannot be handled by documentation. No need to complicate
    the API just in case.

    @kirkshorts
    Copy link
    Mannequin

    kirkshorts mannequin commented Jul 23, 2008

    Sorry for the terribly dumb question about this.

    Are you meaning that, at this stage, all that is required is:

    1. the application of the PyAPI_FUNC macro
    2. move the file to the Include directory
    3. update Makefile.pre.in to point to the new location

    Just I have read this now 10 times or so and keep thinking more must be
    involved :-) [certainly given my embarrassing start to the Python dev
    community re:asynchronous thread exceptions :-| ]

    I have attached a patch that does this. Though at this time it is
    lacking any documentation that will state what parts of "struct
    tok_state" are private and public. I will need to trawl the code some
    more to do that.

    I have executed:

    • ./configure
    • make
    • make test

    And all proceed well.

    @effbot
    Copy link
    Mannequin Author

    effbot mannequin commented Jul 24, 2008

    That's should be all that's needed to expose the existing API, as is.
    If you want to verify the build, you can grab the pytoken.c and setup.py
    files from this directory, and try building the module.

    http://svn.effbot.org/public/stuff/sandbox/pytoken/

    Make sure you remove the local copy of "tokenizer.h" that's present in
    that directory before you build. If that module builds, all's well.

    @kirkshorts
    Copy link
    Mannequin

    kirkshorts mannequin commented Jul 26, 2008

    Did that and it builds fine.

    So my test procedure was:

    • checkout clean source
    • apply patch as per guidelines
    • remove the file Psrser/tokenizer.h (*)
    • ./configure
    • make
    • ./python setup.py install

    Build platform: Ubuntu , gcc 4.2.3

    All works fine.

    thanks for the extra test files.

      • one question though. I removed the file using 'svn remove' but the
        diff makes it an empty file not removed why is that? (and is it correct?)

    @meadori
    Copy link
    Member

    meadori commented Sep 8, 2011

    It would be nice if this same C API was used to implement the 'tokenize' module. Issues like bpo-2180 will potentially require bug fixes in two places :-/

    @AndrewC
    Copy link
    Mannequin

    AndrewC mannequin commented Jun 22, 2014

    The previously posted patch has become outdated due to signature changes staring with revision 89f4293 on Nov 12, 2009. Attached is an updated patch.

    Can it also be confirmed what are the outstanding items for this patch to be applied? Based on the previous logs it's not clear if it's waiting for documentation on the struct tok_state or if there is another change requested. Thanks.

    @djmitche
    Copy link
    Mannequin

    djmitche mannequin commented Apr 14, 2015

    From my read of this bug, there are two distinct tasks mentioned:

    1. make PyTokenizer_* part of the Python-level API
    2. re-implement 'tokenize' in terms of that Python-level API

    #1 is largely complete in Andrew's latest patch, but that will likely need:

    • rebasing
    • hiding struct fields
    • documentation

    #2 is, I think, a separate project. There may be good reasons *not* to do this which I'm not aware of, and barring such reasons the rewrite will be difficult and could potentially change behavior like bpo-2180. So I would suggest filing a new issue for #2 when #1 is complete. And I'll work on #1.

    @djmitche
    Copy link
    Mannequin

    djmitche mannequin commented Apr 14, 2015

    Here's an updated patch for #1:

    Existing Patch:

    • move tokenizer.h from Parser/ to Include/
    • Add PyAPI_Func to export tokenizer functions

    New:

    • Removed unused, undefined PyTokenizer_RestoreEncoding
    • Include PyTokenizer_State with limited ABI compatibility (but still undocumented)
    • namespace the struct name (PyTokenizer_State)
    • Documentation

    I'd like particular attention to the documentation for the tokenizer -- I'm not entirely confident that I have documented the functions correctly! In particular, I'm not sure how PyTokenizer_FromString handles encodings.

    There's a further iteration possible here, but it's beyond my understanding of the tokenizer and of possible uses of the API. That would be to expose some of the tokenizer state fields and document them, either as part of the limited ABI or even the stable API. In particular, there are about a half-dozen struct fields used by the parser, and those would be good candidates for addition to the public API.

    If that's desirable, I'd prefer to merge a revision of my patch first, and keep the issue open for subsequent improvement.

    @djmitche
    Copy link
    Mannequin

    djmitche mannequin commented Apr 14, 2015

    New:

    • rename token symbols in token.h with a PYTOK_ prefix
    • include an example of using the PyTokenizer functions
    • address minor review comments

    @djmitche
    Copy link
    Mannequin

    djmitche mannequin commented Jun 29, 2015

    This seems to have stalled out after the PyCon sprints. Any chance the final patch can be reviewed?

    @DimitrisJim
    Copy link
    Mannequin

    DimitrisJim mannequin commented Mar 13, 2017

    Could you submit a PR for this?

    I haven't seen any objections to this change, a PR will expose this to more people and a clear decision on whether this change is warranted can be finally made (I hope).

    @djmitche
    Copy link
    Mannequin

    djmitche mannequin commented Mar 13, 2017

    If the patch still applies cleanly, I have no issues with you or anyone opening a PR. I picked this up several years ago at the PyCon sprints, and don't remember a thing about it, nor have I touched any other bit of the CPython source since then. So any merge conflicts would be very difficult for me to resolve.

    @DimitrisJim
    Copy link
    Mannequin

    DimitrisJim mannequin commented Mar 14, 2017

    Okay, I'll take a look at it over the next days and try and submit a PR after fixing any issues that might be present.

    @serhiy-storchaka
    Copy link
    Member

    Please hold this until finishing bpo-25643.

    @DimitrisJim
    Copy link
    Mannequin

    DimitrisJim mannequin commented Mar 14, 2017

    Thanks for linking the dependency, Serhiy :-)

    Is there anybody currently working on the other issue? Also, shouldn't both issues now get retagged to Python 3.7?

    @serhiy-storchaka
    Copy link
    Member

    I am working on the other issue (the recent patch is still not published). Sorry, but two issues modify the same code and are conflicting. Since I believe that this issue makes less semantic changes, I think it would be easier to rebase it after finishing bpo-25643 than do it in contrary order.

    @serhiy-storchaka serhiy-storchaka added the 3.7 (EOL) end of life label Mar 14, 2017
    @DimitrisJim
    Copy link
    Mannequin

    DimitrisJim mannequin commented Mar 14, 2017

    That makes sense to me, I'll wait around until the dependency is resolved.

    @asottile
    Copy link
    Mannequin

    asottile mannequin commented Jan 26, 2021

    Serhiy Storchaka is this still blocked? it's been a few years on either this or the linked issue and I'm reaching for this one :)

    @pablogsal
    Copy link
    Member

    I am -1 exposing the C-API of the tokenizer. For the new parser several modifications of the C tokenizer had to be done and some of them modify existing behaviour slightly. I don't want to corner ourselves in a place where we cannot make improvements because is a backwards incompatible change because the API is exposed.

    @asottile
    Copy link
    Mannequin

    asottile mannequin commented Jan 27, 2021

    I'm interested in it because the tokenize module is painfully slow

    @pablogsal
    Copy link
    Member

    I'm interested in it because the tokenize module is painfully slow

    I assumed, but I don't feel confortable exposing the built-in one.

    @pablogsal
    Copy link
    Member

    I assumed, but I don't feel confortable exposing the built-in one.

    As an example of the situation, I want to avoid: every time we change anything in the AST because of internal details we have many complains and pressure from tool authors because they need to add branches or because it makes life more difficult for them it and I absolutely want to avoid more of that.

    @asottile
    Copy link
    Mannequin

    asottile mannequin commented Jan 27, 2021

    you already have that right now because the tokenize module is exposed. (except that every change to the tokenization requires it to be implemented once in C and once in python)

    it's much more frustrating when the two differ as well

    I don't think all the internals of the C tokenization need to be exposed, my main goals would be:

    and the reasons would be:

    • eliminate the (potential) drift and complexity between the two
    • get a fast tokenizer

    Unlike the AST, the tokenization changes much less frequently (last major addition I can remember is the @ operator

    We can hide almost all of the details of the tokenization behind an opaque struct and getter functions

    @pablogsal
    Copy link
    Member

    For reimplementing Lib/tokenize.py we don't need to publicly expose anything in the C-API. We can have a private _tokenize module with uses whatever you need and then you use that _tokenize module in the tokenize.py file to reimplement the exact Python API that the module exposes.

    Publicly exposing the headers or APIs opens new boxes of potential problems: ABI stability, changes in the signatures, changes in the structs. Our experience so far with other parts is that almost always is painful to add optimization to internal functions that are partially exposed, so I am still not convinced offering public C-APIs for the builtin tokenizer.

    @asottile
    Copy link
    Mannequin

    asottile mannequin commented Jan 27, 2021

    private api sounds fine too -- I thought it was necessary to implement the module (as it needs external linkage) but if it isn't then even better

    @pablogsal
    Copy link
    Member

    private api sounds fine too -- I thought it was necessary to implement the module (as it needs external linkage) but if it isn't then even better

    We can make it builtin the same way we do for the _ast module, or we can have a new module under Modules (exposing the symbols in the dynamic table) **but** making them private (and not documented), which explicitly goes against what this issue proposes.

    @asottile
    Copy link
    Mannequin

    asottile mannequin commented Jan 27, 2021

    Either works for me, would you be able to point me to the starting bits as to how _ast becomes builtin?

    @pablogsal
    Copy link
    Member

    Either works for me, would you be able to point me to the starting bits as to how _ast becomes builtin?

    https://github.com/python/cpython/blob/master/Python/Python-ast.c#L10075-L10079

    and

    struct _inittab _PyImport_Inittab[] = {

    But before that I have some questions. For example: How do you plan to implement the readline() interface that tokenize.py uses in the c-module without modifying tokenize.c?

    @asottile
    Copy link
    Mannequin

    asottile mannequin commented Jan 27, 2021

    I haven't looked into or thought about that yet, it might not be possible

    It might also make sense to build new tokenize.py apis avoiding the readline() api -- I always found it painful to work with

    @pablogsal
    Copy link
    Member

    It might also make sense to build new tokenize.py apis avoiding the readline() api -- I always found it painful to work with

    Then we would need to maintain the old Python APIs + the new ones using the module? What you are proposing seems more than just speeding up tokenize.py re-using the existing c code

    @pablogsal
    Copy link
    Member

    I have built a draft of how the changes required to make what you describe, in case you want to finish them:

    https://github.com/pablogsal/cpython/tree/tokenizer_mod

    @pablogsal
    Copy link
    Member

    Problems that you are going to find:

    • The c tokenizer throws syntax errors while the tokenizer module does not. For example:

    ❯ python -c "1_"
    File "<string>", line 1
    1_
    ^
    SyntaxError: invalid decimal literal

    ❯ python -m tokenize <<< "1_"
    1,0-1,1: NUMBER '1'
    1,1-1,2: NAME '_'
    1,2-1,3: NEWLINE '\n'
    2,0-2,0: ENDMARKER ''

    • The encoding cannot be immediately specified. You need to thread it in many places.

    • The readline() function can now return whatever or be whatever, that needs to be handled (better) in the c tokenizer to not crash.

    • str/bytes in the c tokenizer.

    • The c tokenizer does not get the full line in some cases or is tricky to get the full line.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    @iritkatriel iritkatriel added 3.12 bugs and security fixes and removed 3.7 (EOL) end of life labels Sep 12, 2022
    @lysnikolaou
    Copy link
    Contributor

    Since 3.12 the Python tokenize module uses the C tokenizer internally. Is this enough to close this issue?

    @pablogsal
    Copy link
    Member

    Yup

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.12 bugs and security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    6 participants