Issue 12486: tokenize module should have a unicode API

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/56695

classification

Title:	tokenize module should have a unicode API
Type:	enhancement	Stage:	resolved
Components:	Library (Lib)	Versions:	Python 3.8

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	Devin Jeanpierre, barry, eric.araujo, eric.snow, mark.dickinson, martin.panter, mbussonn, meador.inge, michael.foord, petri.lehtinen, serhiy.storchaka, takluyver, terry.reedy, trent, vstinner, willingc
Priority:	normal	Keywords:	patch

Created on 2011-07-04 03:58 by Devin Jeanpierre, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
tokenize_str.diff	serhiy.storchaka, 2012-10-15 20:48		review
tokenize_str_2.diff	serhiy.storchaka, 2015-10-05 06:04		review

Pull Requests
URL	Status	Linked	Edit
PR 6957	merged	takluyver, 2018-05-17 20:40

Messages (19)
msg139733 - (view)	Author: Devin Jeanpierre (Devin Jeanpierre) *	Date: 2011-07-04 03:58
tokenize only deals with bytes. Users might want to deal with unicode source (for example, if python source is embedded into a document with an already-known encoding). The naive approach might be something like: def my_readline(): return my_oldreadline().encode('utf-8') But this doesn't work for python source that declares its encoding, which might be something other than utf-8. The only safe ways are to either manually add a coding line yourself (there are lots of ways, I picked a dumb one): def my_readline_safe(was_read=[]): if not was_read: was_read.append(True)can return b'# coding: utf-8' return my_oldreadline().encode('utf-8') tokenstream = tokenize.tokenize(my_readline_safe) Or to use the same my_readline as before (no added coding line), but instead of passing it to tokenize.tokenize, you could pass it to the undocumented _tokenize function: tokenstream = tokenize._tokenize(my_readline, 'utf-8') Or, ideally, you'd just pass the original readline that produces unicode into a utokenize function: tokenstream = tokenize.utokenize(my_oldreadline)
msg140050 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2011-07-09 05:34
Hmm. Python 3 code is unicode. "Python reads program text as Unicode code points." The tokenize module purports to provide "a lexical scanner for Python source code". But it seems not to do that. Instead it provides a scanner for Python code encoded as bytes, which is something different. So this is at least a doc update issue (which affects 2.7/3.2 also). Another doc issue is given below. A deeper problem is that tokenize uses the semi-obsolete readline protocol, which probably dates to 1.0 and which expects the source to be a file or file-like. The more recent iterator protocol would lets the source be anything. A modern tokenize function should accept an iterable of strings. This would include but not be limited to a file opened in text mode. A related problem is that 'tokenize' is a convenience function that does several things bundled together. 1. Read lines as bytes from a file-like source. 2. Detect encoding. 3. Decode lines to strings. 4. Actually tokenize the strings to tokens. I understand this feature request to be a request that function 4, the actual Python 3 code tokenizer be unbundled and exposed to users. I agree with this request. Any user that starts with actual Py3 code would benefit. (Compile() is another function that bundles a tokenizer.) Back to the current doc and another doc problem. The entry for untokenize() says "Converts tokens back into Python source code. ...The reconstructed script is returned as a single string." That would be nice if true, but I am going to guess it is not, as the entry continues "It returns bytes, encoded using the ENCODING token,". In Py3, string != bytes, so this seems an incomplete doc conversion from Py2.
msg140055 - (view)	Author: STINNER Victor (vstinner) *	Date: 2011-07-09 09:03
The compiler has a PyCF_SOURCE_IS_UTF8 flag: see compile() builtin. The parser has a flag to ignore the coding cookie: PyPARSE_IGNORE_COOKIE. Patch tokenize to support Unicode is simple: use PyCF_SOURCE_IS_UTF8 and/or PyPARSE_IGNORE_COOKIE flags and encode the strings to UTF-8. Rewrite the parser to work directly on Unicode is much more complex and I don't think that we need that.
msg173001 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-10-15 20:48
Patch to allow tokenize() accepts string is very simple, only 4 lines. But it requires a lot of documentation changes. Then we can get rid of undocumented generate_tokens(). Note, stdlib an tools use only generate_tokens(), none uses tokenize(). Of course, it will be better if tokenize() will work with iterator protocol. Here is a preliminary patch. I will be thankful for the help with the documentation and for the discussion. Of course, it will be better if tokenize() will work with iterator protocol.
msg178473 - (view)	Author: Meador Inge (meador.inge) *	Date: 2012-12-29 05:23
See also issue9969.
msg252300 - (view)	Author: Martin Panter (martin.panter) *	Date: 2015-10-05 03:27
I agree it would be very useful to be able to tokenize arbitrary text without worrying about encoding tokens. I left some suggestions for the documentation changes. Also some test cases for it would be good. However I wonder if a separate function would be better for the text mode tokenization. It would make it clearer when an ENCODING token is expected and when it isn’t, and would avoid any confusion about what happens when readline() returns a byte string one time and a text string another time. Also, having untokenize() changes its output type depending on the ENCODING token seems like bad design to me. Why not just bless the existing generate_tokens() function as a public API, perhaps renaming it to something clearer like tokenize_text() or tokenize_text_lines() at the same time?
msg252305 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2015-10-05 06:01
Thank you for your review Martin. Here is rebased patch that addresses Matin's comments. I agree that having untokenize() changes its output type depending on the ENCODING token is bad design and we should change this. But this is perhaps other issue.
msg252309 - (view)	Author: Martin Panter (martin.panter) *	Date: 2015-10-05 07:35
I didn’t notice that this dual untokenize() behaviour already existed. Taking that into account weakens my argument for having separate text and bytes tokenize() functions.
msg313591 - (view)	Author: Thomas Kluyver (takluyver) *	Date: 2018-03-11 09:22
> Why not just bless the existing generate_tokens() function as a public API We're actually using generate_tokens() from IPython - we wanted a way to tokenize unicode strings, and although it's undocumented, it's been there for a number of releases and does what we want. So +1 to promoting it to a public API. In fact, at the moment, IPython has its own copy of tokenize to fix one or two old issues. I'm trying to get rid of that and use the stdlib module again, which is how I came to notice that we're using an undocumented API.
msg316982 - (view)	Author: Matthias Bussonnier (mbussonn) *	Date: 2018-05-17 20:28
> Why not just bless the existing generate_tokens() function as a public API, Yes please, or just make the private `_tokenize` public under another name. The `tokenize.tokenize` method try to magically detect encoding which may be unnecessary.
msg317004 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2018-05-18 07:01
The old generate_tokens() was renamed to tokenize() in issue719888 because the latter is a better name. Is "generate_tokens" considered a good name now?
msg317010 - (view)	Author: Thomas Kluyver (takluyver) *	Date: 2018-05-18 07:48
I wouldn't say it's a good name, but I think the advantage of documenting an existing name outweighs that. We can start (or continue) using generate_tokens() right away, whereas a new name presumably wouldn't be available until Python 3.8 comes out. And we usually don't require a new Python version until a couple of years after it is released. If we want to add better names or clearer APIs on top of this, great. But I don't want that discussion to hold up the simple step of committing to keep the existing API.
msg317011 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2018-05-18 07:59
My concern is that we will have two functions with non-similar names (tokenize() and generate_tokens()) that does virtually the same, but accept different types of input (bytes or str), and the single function untokenize() that produces different type of result depending on the value of input. This doesn't look like a good design to me.
msg317018 - (view)	Author: Thomas Kluyver (takluyver) *	Date: 2018-05-18 08:21
I agree, it's not a good design, but it's what's already there; I just want to ensure that it won't be removed without a deprecation cycle. My PR makes no changes to behaviour, only to documentation and tests. This and issue 9969 have both been around for several years. A new tokenize API is clearly not at the top of anyone's priority list - and that's fine. I'd rather have some unicode API today than a promise of a nice unicode API in the future. And it doesn't preclude adding a better API later, it just means that the existing API would have to have a deprecation cycle.
msg317020 - (view)	Author: Martin Panter (martin.panter) *	Date: 2018-05-18 08:53
Don’t forget about updating __all__.
msg317021 - (view)	Author: Thomas Kluyver (takluyver) *	Date: 2018-05-18 08:56
Thanks - I had forgotten it, just fixed it now.
msg317912 - (view)	Author: Thomas Kluyver (takluyver) *	Date: 2018-05-28 20:09
The tests on PR #6957 are passing now, if anyone has time to have a look. :-)
msg318775 - (view)	Author: Carol Willing (willingc) *	Date: 2018-06-05 17:26
New changeset c56b17bd8c7a3fd03859822246633d2c9586f8bd by Carol Willing (Thomas Kluyver) in branch 'master': bpo-12486: Document tokenize.generate_tokens() as public API (#6957) https://github.com/python/cpython/commit/c56b17bd8c7a3fd03859822246633d2c9586f8bd
msg318778 - (view)	Author: Thomas Kluyver (takluyver) *	Date: 2018-06-05 17:51
Thanks Carol :-)

History
Date	User	Action	Args
2022-04-11 14:57:19	admin	set	github: 56695
2018-06-05 17:51:27	takluyver	set	messages: + msg318778
2018-06-05 17:30:56	willingc	set	status: open -> closed resolution: fixed stage: patch review -> resolved
2018-06-05 17:26:41	willingc	set	nosy: + willingc messages: + msg318775
2018-05-28 20:09:06	takluyver	set	messages: + msg317912
2018-05-18 08:56:08	takluyver	set	messages: + msg317021
2018-05-18 08:53:49	martin.panter	set	messages: + msg317020
2018-05-18 08:21:22	takluyver	set	messages: + msg317018
2018-05-18 07:59:47	serhiy.storchaka	set	messages: + msg317011
2018-05-18 07:48:49	takluyver	set	messages: + msg317010
2018-05-18 07:04:20	serhiy.storchaka	set	nosy: + barry, mark.dickinson, trent, michael.foord versions: + Python 3.8, - Python 3.6
2018-05-18 07:01:30	serhiy.storchaka	set	messages: + msg317004
2018-05-17 20:40:11	takluyver	set	pull_requests: + pull_request6616
2018-05-17 20:28:39	mbussonn	set	nosy: + mbussonn messages: + msg316982
2018-03-11 09:22:55	takluyver	set	nosy: + takluyver messages: + msg313591
2015-10-05 07:35:53	martin.panter	set	messages: + msg252309
2015-10-05 06:04:47	serhiy.storchaka	set	files: + tokenize_str_2.diff
2015-10-05 06:01:59	serhiy.storchaka	set	messages: + msg252305 versions: + Python 3.6, - Python 3.4
2015-10-05 03:27:56	martin.panter	set	nosy: + martin.panter messages: + msg252300 stage: patch review
2013-02-04 17:06:59	r.david.murray	link	issue17125 superseder
2012-12-29 05:23:22	meador.inge	set	nosy: + meador.inge messages: + msg178473
2012-10-15 20:48:44	serhiy.storchaka	set	files: + tokenize_str.diff versions: + Python 3.4, - Python 3.3 nosy: + serhiy.storchaka messages: + msg173001 keywords: + patch
2012-10-14 04:15:32	eric.snow	set	nosy: terry.reedy, vstinner, Devin Jeanpierre, eric.araujo, eric.snow, petri.lehtinen
2011-07-09 20:53:49	eric.snow	set	nosy: + eric.snow
2011-07-09 09:03:46	vstinner	set	messages: + msg140055
2011-07-09 05:34:16	terry.reedy	set	nosy: + terry.reedy messages: + msg140050
2011-07-08 17:49:51	petri.lehtinen	set	nosy: + petri.lehtinen
2011-07-04 16:23:57	eric.araujo	set	nosy: + vstinner, eric.araujo type: enhancement versions: + Python 3.3
2011-07-04 03:58:16	Devin Jeanpierre	create