Issue 3353: make built-in tokenizer available via Python C API

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/47603

classification

Title:	make built-in tokenizer available via Python C API
Type:	enhancement	Stage:	patch review
Components:	Interpreter Core	Versions:	Python 3.7

process

Status:	open	Resolution:
Dependencies:	25643	Superseder:
Assigned To:		Nosy List:	Andrew.C, Anthony Sottile, Jim Fasarakis-Hilliard, amaury.forgeotdarc, berker.peksag, djmitche, effbot, kirkshorts, meador.inge, pablogsal, serhiy.storchaka, superluser
Priority:	normal	Keywords:	patch

Created on 2008-07-14 11:32 by effbot, last changed 2022-04-11 14:56 by admin.

Files
File name	Uploaded	Description	Edit
issue3353.diff	kirkshorts, 2008-07-23 22:53	Patch to move the include file etc
82706ea73ada.diff	Andrew.C, 2014-06-22 18:36		review
issue3353.patch	djmitche, 2015-04-14 16:07	issue3353.patch	review
issue3353-2.patch	djmitche, 2015-04-14 18:06	issue3353-2.patch	review

Repositories containing patches
https://bitbucket.org/bloomberg/cpython/#issue3353

Messages (34)
msg69650 - (view)	Author: Fredrik Lundh (effbot) *	Date: 2008-07-14 11:32
CPython provides a Python-level API to the parser, but not to the tokenizer itself. Somewhat annoyingly, it does provide a nice C API, but that's not properly exposed for external modules. To fix this, the tokenizer.h file should be moved from the Parser directory to the Include directory, and the (semi-public) functions that already available must be flagged with PyAPI_FUNC, as shown below. The PyAPI_FUNC fix should be non-intrusive enough to go into 2.6 and 3.0; moving stuff around is perhaps better left for a later release (which could also include a Python binding). Index: tokenizer.h =================================================================== --- tokenizer.h (revision 514) +++ tokenizer.h (working copy) @@ -54,10 +54,10 @@ const char* str; }; -extern struct tok_state PyTokenizer_FromString(const char ); -extern struct tok_state PyTokenizer_FromFile(FILE , char , char ); -extern void PyTokenizer_Free(struct tok_state ); -extern int PyTokenizer_Get(struct tok_state , char , char ); +PyAPI_FUNC(struct tok_state ) PyTokenizer_FromString(const char ); +PyAPI_FUNC(struct tok_state ) PyTokenizer_FromFile(FILE , char , char ); +PyAPI_FUNC(void) PyTokenizer_Free(struct tok_state ); +PyAPI_FUNC(int) PyTokenizer_Get(struct tok_state , char , char ); #ifdef __cplusplus }
msg70101 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *	Date: 2008-07-21 10:00
IMO the "struct tok_state" should not be part of the API, it contains too many implementation details. Or maybe as an opaque structure.
msg70102 - (view)	Author: Fredrik Lundh (effbot) *	Date: 2008-07-21 10:03
There are a few things in the struct that needs to be public, but that's nothing that cannot be handled by documentation. No need to complicate the API just in case.
msg70181 - (view)	Author: Andy (kirkshorts)	Date: 2008-07-23 22:53
Sorry for the terribly dumb question about this. Are you meaning that, at this stage, all that is required is: 1. the application of the PyAPI_FUNC macro 2. move the file to the Include directory 3. update Makefile.pre.in to point to the new location Just I have read this now 10 times or so and keep thinking more must be involved :-) [certainly given my embarrassing start to the Python dev community re:asynchronous thread exceptions :-\| ] I have attached a patch that does this. Though at this time it is lacking any documentation that will state what parts of "struct tok_state" are private and public. I will need to trawl the code some more to do that. I have executed: - ./configure - make - make test And all proceed well.
msg70227 - (view)	Author: Fredrik Lundh (effbot) *	Date: 2008-07-24 21:25
That's should be all that's needed to expose the existing API, as is. If you want to verify the build, you can grab the pytoken.c and setup.py files from this directory, and try building the module. http://svn.effbot.org/public/stuff/sandbox/pytoken/ Make sure you remove the local copy of "tokenizer.h" that's present in that directory before you build. If that module builds, all's well.
msg70305 - (view)	Author: Andy (kirkshorts)	Date: 2008-07-26 20:59
Did that and it builds fine. So my test procedure was: - checkout clean source - apply patch as per guidelines - remove the file Psrser/tokenizer.h () - ./configure - make - ./python setup.py install Build platform: Ubuntu , gcc 4.2.3 All works fine. thanks for the extra test files. - one question though. I removed the file using 'svn remove' but the diff makes it an empty file not removed why is that? (and is it correct?)
msg143717 - (view)	Author: Meador Inge (meador.inge) *	Date: 2011-09-08 01:47
It would be nice if this same C API was used to implement the 'tokenize' module. Issues like issue2180 will potentially require bug fixes in two places :-/
msg221293 - (view)	Author: Andrew C (Andrew.C) *	Date: 2014-06-22 18:35
The previously posted patch has become outdated due to signature changes staring with revision 89f4293 on Nov 12, 2009. Attached is an updated patch. Can it also be confirmed what are the outstanding items for this patch to be applied? Based on the previous logs it's not clear if it's waiting for documentation on the struct tok_state or if there is another change requested. Thanks.
msg240882 - (view)	Author: Dustin J. Mitchell (djmitche) *	Date: 2015-04-14 13:35
From my read of this bug, there are two distinct tasks mentioned: 1. make PyTokenizer_* part of the Python-level API 2. re-implement 'tokenize' in terms of that Python-level API #1 is largely complete in Andrew's latest patch, but that will likely need: * rebasing * hiding struct fields * documentation #2 is, I think, a separate project. There may be good reasons not to do this which I'm not aware of, and barring such reasons the rewrite will be difficult and could potentially change behavior like issue2180. So I would suggest filing a new issue for #2 when #1 is complete. And I'll work on #1.
msg240927 - (view)	Author: Dustin J. Mitchell (djmitche) *	Date: 2015-04-14 16:07
Here's an updated patch for #1: Existing Patch: - move tokenizer.h from Parser/ to Include/ - Add PyAPI_Func to export tokenizer functions New: - Removed unused, undefined PyTokenizer_RestoreEncoding - Include PyTokenizer_State with limited ABI compatibility (but still undocumented) - namespace the struct name (PyTokenizer_State) - Documentation I'd like particular attention to the documentation for the tokenizer -- I'm not entirely confident that I have documented the functions correctly! In particular, I'm not sure how PyTokenizer_FromString handles encodings. There's a further iteration possible here, but it's beyond my understanding of the tokenizer and of possible uses of the API. That would be to expose some of the tokenizer state fields and document them, either as part of the limited ABI or even the stable API. In particular, there are about a half-dozen struct fields used by the parser, and those would be good candidates for addition to the public API. If that's desirable, I'd prefer to merge a revision of my patch first, and keep the issue open for subsequent improvement.
msg240967 - (view)	Author: Dustin J. Mitchell (djmitche) *	Date: 2015-04-14 18:06
New: - rename token symbols in token.h with a PYTOK_ prefix - include an example of using the PyTokenizer functions - address minor review comments
msg245939 - (view)	Author: Dustin J. Mitchell (djmitche) *	Date: 2015-06-29 14:41
This seems to have stalled out after the PyCon sprints. Any chance the final patch can be reviewed?
msg289535 - (view)	Author: Jim Fasarakis-Hilliard (Jim Fasarakis-Hilliard) *	Date: 2017-03-13 10:03
Could you submit a PR for this? I haven't seen any objections to this change, a PR will expose this to more people and a clear decision on whether this change is warranted can be finally made (I hope).
msg289537 - (view)	Author: Dustin J. Mitchell (djmitche) *	Date: 2017-03-13 12:45
If the patch still applies cleanly, I have no issues with you or anyone opening a PR. I picked this up several years ago at the PyCon sprints, and don't remember a thing about it, nor have I touched any other bit of the CPython source since then. So any merge conflicts would be very difficult for me to resolve.
msg289584 - (view)	Author: Jim Fasarakis-Hilliard (Jim Fasarakis-Hilliard) *	Date: 2017-03-14 13:46
Okay, I'll take a look at it over the next days and try and submit a PR after fixing any issues that might be present.
msg289585 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2017-03-14 13:53
Please hold this until finishing issue25643.
msg289587 - (view)	Author: Jim Fasarakis-Hilliard (Jim Fasarakis-Hilliard) *	Date: 2017-03-14 13:59
Thanks for linking the dependency, Serhiy :-) Is there anybody currently working on the other issue? Also, shouldn't both issues now get retagged to Python 3.7?
msg289590 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2017-03-14 14:24
I am working on the other issue (the recent patch is still not published). Sorry, but two issues modify the same code and are conflicting. Since I believe that this issue makes less semantic changes, I think it would be easier to rebase it after finishing issue25643 than do it in contrary order.
msg289591 - (view)	Author: Jim Fasarakis-Hilliard (Jim Fasarakis-Hilliard) *	Date: 2017-03-14 14:28
That makes sense to me, I'll wait around until the dependency is resolved.
msg385736 - (view)	Author: Anthony Sottile (Anthony Sottile) *	Date: 2021-01-26 21:33
Serhiy Storchaka is this still blocked? it's been a few years on either this or the linked issue and I'm reaching for this one :)
msg385756 - (view)	Author: Pablo Galindo Salgado (pablogsal) *	Date: 2021-01-27 10:49
I am -1 exposing the C-API of the tokenizer. For the new parser several modifications of the C tokenizer had to be done and some of them modify existing behaviour slightly. I don't want to corner ourselves in a place where we cannot make improvements because is a backwards incompatible change because the API is exposed.
msg385788 - (view)	Author: Anthony Sottile (Anthony Sottile) *	Date: 2021-01-27 16:47
I'm interested in it because the `tokenize` module is painfully slow
msg385790 - (view)	Author: Pablo Galindo Salgado (pablogsal) *	Date: 2021-01-27 17:05
> I'm interested in it because the `tokenize` module is painfully slow I assumed, but I don't feel confortable exposing the built-in one.
msg385791 - (view)	Author: Pablo Galindo Salgado (pablogsal) *	Date: 2021-01-27 17:10
> I assumed, but I don't feel confortable exposing the built-in one. As an example of the situation, I want to avoid: every time we change anything in the AST because of internal details we have many complains and pressure from tool authors because they need to add branches or because it makes life more difficult for them it and I absolutely want to avoid more of that.
msg385792 - (view)	Author: Anthony Sottile (Anthony Sottile) *	Date: 2021-01-27 17:18
you already have that right now because the `tokenize` module is exposed. (except that every change to the tokenization requires it to be implemented once in C and once in python) it's much more frustrating when the two differ as well I don't think all the internals of the C tokenization need to be exposed, my main goals would be: - expose enough information to reimplement Lib/tokenize.py - replace Lib/tokenize.py with the C tokenizer and the reasons would be: - eliminate the (potential) drift and complexity between the two - get a fast tokenizer Unlike the AST, the tokenization changes much less frequently (last major addition I can remember is the `@` operator We can hide almost all of the details of the tokenization behind an opaque struct and getter functions
msg385793 - (view)	Author: Pablo Galindo Salgado (pablogsal) *	Date: 2021-01-27 17:36
For reimplementing Lib/tokenize.py we don't need to publicly expose anything in the C-API. We can have a private _tokenize module with uses whatever you need and then you use that _tokenize module in the tokenize.py file to reimplement the exact Python API that the module exposes. Publicly exposing the headers or APIs opens new boxes of potential problems: ABI stability, changes in the signatures, changes in the structs. Our experience so far with other parts is that almost always is painful to add optimization to internal functions that are partially exposed, so I am still not convinced offering public C-APIs for the builtin tokenizer.
msg385794 - (view)	Author: Anthony Sottile (Anthony Sottile) *	Date: 2021-01-27 17:43
private api sounds fine too -- I thought it was necessary to implement the module (as it needs external linkage) but if it isn't then even better
msg385795 - (view)	Author: Pablo Galindo Salgado (pablogsal) *	Date: 2021-01-27 17:47
> private api sounds fine too -- I thought it was necessary to implement the module (as it needs external linkage) but if it isn't then even better We can make it builtin the same way we do for the _ast module, or we can have a new module under Modules (exposing the symbols in the dynamic table) but making them private (and not documented), which explicitly goes against what this issue proposes.
msg385796 - (view)	Author: Anthony Sottile (Anthony Sottile) *	Date: 2021-01-27 18:08
Either works for me, would you be able to point me to the starting bits as to how `_ast` becomes builtin?
msg385797 - (view)	Author: Pablo Galindo Salgado (pablogsal) *	Date: 2021-01-27 18:13
> Either works for me, would you be able to point me to the starting bits as to how `_ast` becomes builtin? https://github.com/python/cpython/blob/master/Python/Python-ast.c#L10075-L10079 and https://github.com/python/cpython/blob/63298930fb531ba2bb4f23bc3b915dbf1e17e9e1/PC/config.c#L84 But before that I have some questions. For example: How do you plan to implement the readline() interface that tokenize.py uses in the c-module without modifying tokenize.c?
msg385798 - (view)	Author: Anthony Sottile (Anthony Sottile) *	Date: 2021-01-27 18:19
I haven't looked into or thought about that yet, it might not be possible It might also make sense to build new tokenize.py apis avoiding the `readline()` api -- I always found it painful to work with
msg385799 - (view)	Author: Pablo Galindo Salgado (pablogsal) *	Date: 2021-01-27 18:22
> It might also make sense to build new tokenize.py apis avoiding the `readline()` api -- I always found it painful to work with Then we would need to maintain the old Python APIs + the new ones using the module? What you are proposing seems more than just speeding up tokenize.py re-using the existing c code
msg385808 - (view)	Author: Pablo Galindo Salgado (pablogsal) *	Date: 2021-01-27 20:58
I have built a draft of how the changes required to make what you describe, in case you want to finish them: https://github.com/pablogsal/cpython/tree/tokenizer_mod
msg385811 - (view)	Author: Pablo Galindo Salgado (pablogsal) *	Date: 2021-01-27 21:14
Problems that you are going to find: * The c tokenizer throws syntax errors while the tokenizer module does not. For example: ❯ python -c "1_" File "<string>", line 1 1_ ^ SyntaxError: invalid decimal literal ❯ python -m tokenize <<< "1_" 1,0-1,1: NUMBER '1' 1,1-1,2: NAME '_' 1,2-1,3: NEWLINE '\n' 2,0-2,0: ENDMARKER '' * The encoding cannot be immediately specified. You need to thread it in many places. * The readline() function can now return whatever or be whatever, that needs to be handled (better) in the c tokenizer to not crash. * str/bytes in the c tokenizer. * The c tokenizer does not get the full line in some cases or is tricky to get the full line.

History
Date	User	Action	Args
2022-04-11 14:56:36	admin	set	github: 47603
2021-01-27 21:14:20	pablogsal	set	messages: + msg385811
2021-01-27 20:58:53	pablogsal	set	messages: + msg385808
2021-01-27 18:22:11	pablogsal	set	messages: + msg385799
2021-01-27 18:19:20	Anthony Sottile	set	messages: + msg385798
2021-01-27 18:13:38	pablogsal	set	messages: + msg385797
2021-01-27 18:08:34	Anthony Sottile	set	messages: + msg385796
2021-01-27 17:47:38	pablogsal	set	messages: + msg385795
2021-01-27 17:43:25	Anthony Sottile	set	messages: + msg385794
2021-01-27 17:36:58	pablogsal	set	messages: + msg385793
2021-01-27 17:18:04	Anthony Sottile	set	messages: + msg385792
2021-01-27 17:10:10	pablogsal	set	messages: + msg385791
2021-01-27 17:05:53	pablogsal	set	messages: + msg385790
2021-01-27 16:47:49	Anthony Sottile	set	messages: + msg385788
2021-01-27 10:49:44	pablogsal	set	nosy: + pablogsal messages: + msg385756
2021-01-26 21:33:09	Anthony Sottile	set	nosy: + Anthony Sottile messages: + msg385736
2017-03-14 14:28:59	Jim Fasarakis-Hilliard	set	messages: + msg289591
2017-03-14 14:24:49	serhiy.storchaka	set	messages: + msg289590 versions: + Python 3.7, - Python 3.6
2017-03-14 13:59:44	Jim Fasarakis-Hilliard	set	messages: + msg289587
2017-03-14 13:53:15	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg289585
2017-03-14 13:52:27	serhiy.storchaka	set	dependencies: + Python tokenizer rewriting
2017-03-14 13:46:39	Jim Fasarakis-Hilliard	set	messages: + msg289584
2017-03-13 12:45:42	djmitche	set	messages: + msg289537
2017-03-13 10:03:04	Jim Fasarakis-Hilliard	set	nosy: + Jim Fasarakis-Hilliard messages: + msg289535
2015-11-14 12:36:26	berker.peksag	set	nosy: + berker.peksag versions: + Python 3.6, - Python 3.5
2015-11-06 03:12:09	superluser	set	nosy: + superluser
2015-06-29 14:41:47	djmitche	set	messages: + msg245939
2015-04-15 02:36:41	ned.deily	set	stage: test needed -> patch review
2015-04-14 18:06:33	djmitche	set	files: + issue3353-2.patch messages: + msg240967
2015-04-14 16:08:00	djmitche	set	files: + issue3353.patch messages: + msg240927
2015-04-14 13:35:08	djmitche	set	nosy: + djmitche messages: + msg240882
2014-06-22 18:36:54	Andrew.C	set	files: + 82706ea73ada.diff
2014-06-22 18:35:28	Andrew.C	set	hgrepos: + hgrepo260 messages: + msg221293 nosy: + Andrew.C
2014-06-20 16:09:30	zach.ware	set	versions: + Python 3.5, - Python 3.2
2011-09-08 01:47:33	meador.inge	set	nosy: + meador.inge messages: + msg143717
2010-08-09 18:36:46	terry.reedy	set	versions: - Python 2.7
2009-05-16 20:35:06	ajaksu2	set	priority: normal stage: test needed versions: + Python 3.2, - Python 2.6, Python 3.0
2008-07-26 20:59:58	kirkshorts	set	messages: + msg70305
2008-07-24 21:25:36	effbot	set	messages: + msg70227
2008-07-23 22:53:05	kirkshorts	set	files: + issue3353.diff nosy: + kirkshorts messages: + msg70181 keywords: + patch
2008-07-21 10:03:44	effbot	set	messages: + msg70102
2008-07-21 10:00:39	amaury.forgeotdarc	set	nosy: + amaury.forgeotdarc messages: + msg70101
2008-07-14 11:32:15	effbot	create