classification
Title: tokenize module should have a unicode API
Type: enhancement Stage: resolved
Components: Library (Lib) Versions: Python 3.8
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: Devin Jeanpierre, barry, eric.araujo, eric.snow, mark.dickinson, martin.panter, mbussonn, meador.inge, michael.foord, petri.lehtinen, serhiy.storchaka, takluyver, terry.reedy, trent, vstinner, willingc
Priority: normal Keywords: patch

Created on 2011-07-04 03:58 by Devin Jeanpierre, last changed 2018-06-05 17:51 by takluyver. This issue is now closed.

Files
File name Uploaded Description Edit
tokenize_str.diff serhiy.storchaka, 2012-10-15 20:48 review
tokenize_str_2.diff serhiy.storchaka, 2015-10-05 06:04 review
Pull Requests
URL Status Linked Edit
PR 6957 merged takluyver, 2018-05-17 20:40
Messages (19)
msg139733 - (view) Author: Devin Jeanpierre (Devin Jeanpierre) * Date: 2011-07-04 03:58
tokenize only deals with bytes. Users might want to deal with unicode source (for example, if python source is embedded into a document with an already-known encoding).

The naive approach might be something like:

  def my_readline():
      return my_oldreadline().encode('utf-8')

But this doesn't work for python source that declares its encoding, which might be something other than utf-8. The only safe ways are to either manually add a coding line yourself (there are lots of ways, I picked a dumb one):

  def my_readline_safe(was_read=[]):
      if not was_read:
          was_read.append(True)can 
          return b'# coding: utf-8'
      return my_oldreadline().encode('utf-8')

  tokenstream = tokenize.tokenize(my_readline_safe)

Or to use the same my_readline as before (no added coding line), but instead of passing it to tokenize.tokenize, you could pass it to the undocumented _tokenize function:

    tokenstream = tokenize._tokenize(my_readline, 'utf-8')

Or, ideally, you'd just pass the original readline that produces unicode into a utokenize function:

    tokenstream = tokenize.utokenize(my_oldreadline)
msg140050 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2011-07-09 05:34
Hmm. Python 3 code is unicode. "Python reads program text as Unicode code points." The tokenize module purports to provide "a lexical scanner for Python source code". But it seems not to do that. Instead it provides a scanner for Python code encoded as bytes, which is something different. So this is at least a doc update issue (which affects 2.7/3.2 also). Another doc issue is given below.

A deeper problem is that tokenize uses the semi-obsolete readline protocol, which probably dates to 1.0 and which expects the source to be a file or file-like. The more recent iterator protocol would lets the source be anything. A modern tokenize function should accept an iterable of  strings. This would include but not be limited to a file opened in text mode.

A related problem is that 'tokenize' is a convenience function that does several things bundled together.
1. Read lines as bytes from a file-like source.
2. Detect encoding.
3. Decode lines to strings.
4. Actually tokenize the strings to tokens.

I understand this feature request to be a request that function 4, the actual Python 3 code tokenizer be unbundled and exposed to users. I agree with this request. Any user that starts with actual Py3 code would benefit.

(Compile() is another function that bundles a tokenizer.)

Back to the current doc and another doc problem. The entry for untokenize() says "Converts tokens back into Python source code. ...The reconstructed script is returned as a single string." That would be nice if true, but I am going to guess it is not, as the entry continues "It returns bytes, encoded using the ENCODING token,". In Py3, string != bytes, so this seems an incomplete doc conversion from Py2.
msg140055 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-07-09 09:03
The compiler has a PyCF_SOURCE_IS_UTF8 flag: see compile() builtin. The parser has a flag to ignore the coding cookie: PyPARSE_IGNORE_COOKIE.

Patch tokenize to support Unicode is simple: use PyCF_SOURCE_IS_UTF8 and/or PyPARSE_IGNORE_COOKIE flags and encode the strings to UTF-8.

Rewrite the parser to work directly on Unicode is much more complex and I don't think that we need that.
msg173001 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-10-15 20:48
Patch to allow tokenize() accepts string is very simple, only 4 lines. But it requires a lot of documentation changes.

Then we can get rid of undocumented generate_tokens(). Note, stdlib an tools use only generate_tokens(), none uses tokenize(). Of course, it will be better if tokenize() will work with iterator protocol.

Here is a preliminary patch. I will be thankful for the help with the documentation and for the discussion.

Of course, it will be better if tokenize() will work with iterator protocol.
msg178473 - (view) Author: Meador Inge (meador.inge) * (Python committer) Date: 2012-12-29 05:23
See also issue9969.
msg252300 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-10-05 03:27
I agree it would be very useful to be able to tokenize arbitrary text without worrying about encoding tokens. I left some suggestions for the documentation changes. Also some test cases for it would be good.

However I wonder if a separate function would be better for the text mode tokenization. It would make it clearer when an ENCODING token is expected and when it isn’t, and would avoid any confusion about what happens when readline() returns a byte string one time and a text string another time. Also, having untokenize() changes its output type depending on the ENCODING token seems like bad design to me.

Why not just bless the existing generate_tokens() function as a public API, perhaps renaming it to something clearer like tokenize_text() or tokenize_text_lines() at the same time?
msg252305 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-10-05 06:01
Thank you for your review Martin.

Here is rebased patch that addresses Matin's comments.

I agree that having untokenize() changes its output type depending on the ENCODING token is bad design and we should change this. But this is perhaps other issue.
msg252309 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-10-05 07:35
I didn’t notice that this dual untokenize() behaviour already existed. Taking that into account weakens my argument for having separate text and bytes tokenize() functions.
msg313591 - (view) Author: Thomas Kluyver (takluyver) * Date: 2018-03-11 09:22
> Why not just bless the existing generate_tokens() function as a public API

We're actually using generate_tokens() from IPython - we wanted a way to tokenize unicode strings, and although it's undocumented, it's been there for a number of releases and does what we want. So +1 to promoting it to a public API.

In fact, at the moment, IPython has its own copy of tokenize to fix one or two old issues. I'm trying to get rid of that and use the stdlib module again, which is how I came to notice that we're using an undocumented API.
msg316982 - (view) Author: Matthias Bussonnier (mbussonn) * Date: 2018-05-17 20:28
> Why not just bless the existing generate_tokens() function as a public API, 

Yes please, or just make the private `_tokenize` public under another name. The `tokenize.tokenize` method try to magically detect encoding which may be unnecessary.
msg317004 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-05-18 07:01
The old generate_tokens() was renamed to tokenize() in issue719888 because the latter is a better name. Is "generate_tokens" considered a good name now?
msg317010 - (view) Author: Thomas Kluyver (takluyver) * Date: 2018-05-18 07:48
I wouldn't say it's a good name, but I think the advantage of documenting an existing name outweighs that. We can start (or continue) using generate_tokens() right away, whereas a new name presumably wouldn't be available until Python 3.8 comes out. And we usually don't require a new Python version until a couple of years after it is released.

If we want to add better names or clearer APIs on top of this, great. But I don't want that discussion to hold up the simple step of committing to keep the existing API.
msg317011 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-05-18 07:59
My concern is that we will have two functions with non-similar names (tokenize() and generate_tokens()) that does virtually the same, but accept different types of input (bytes or str), and the single function untokenize() that produces different type of result depending on the value of input. This doesn't look like a good design to me.
msg317018 - (view) Author: Thomas Kluyver (takluyver) * Date: 2018-05-18 08:21
I agree, it's not a good design, but it's what's already there; I just want to ensure that it won't be removed without a deprecation cycle. My PR makes no changes to behaviour, only to documentation and tests.

This and issue 9969 have both been around for several years. A new tokenize API is clearly not at the top of anyone's priority list - and that's fine. I'd rather have *some* unicode API today than a promise of a nice unicode API in the future. And it doesn't preclude adding a better API later, it just means that the existing API would have to have a deprecation cycle.
msg317020 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2018-05-18 08:53
Don’t forget about updating __all__.
msg317021 - (view) Author: Thomas Kluyver (takluyver) * Date: 2018-05-18 08:56
Thanks - I had forgotten it, just fixed it now.
msg317912 - (view) Author: Thomas Kluyver (takluyver) * Date: 2018-05-28 20:09
The tests on PR #6957 are passing now, if anyone has time to have a look. :-)
msg318775 - (view) Author: Carol Willing (willingc) * (Python committer) Date: 2018-06-05 17:26
New changeset c56b17bd8c7a3fd03859822246633d2c9586f8bd by Carol Willing (Thomas Kluyver) in branch 'master':
bpo-12486: Document tokenize.generate_tokens() as public API (#6957)
https://github.com/python/cpython/commit/c56b17bd8c7a3fd03859822246633d2c9586f8bd
msg318778 - (view) Author: Thomas Kluyver (takluyver) * Date: 2018-06-05 17:51
Thanks Carol :-)
History
Date User Action Args
2018-06-05 17:51:27takluyversetmessages: + msg318778
2018-06-05 17:30:56willingcsetstatus: open -> closed
resolution: fixed
stage: patch review -> resolved
2018-06-05 17:26:41willingcsetnosy: + willingc
messages: + msg318775
2018-05-28 20:09:06takluyversetmessages: + msg317912
2018-05-18 08:56:08takluyversetmessages: + msg317021
2018-05-18 08:53:49martin.pantersetmessages: + msg317020
2018-05-18 08:21:22takluyversetmessages: + msg317018
2018-05-18 07:59:47serhiy.storchakasetmessages: + msg317011
2018-05-18 07:48:49takluyversetmessages: + msg317010
2018-05-18 07:04:20serhiy.storchakasetnosy: + barry, mark.dickinson, trent, michael.foord

versions: + Python 3.8, - Python 3.6
2018-05-18 07:01:30serhiy.storchakasetmessages: + msg317004
2018-05-17 20:40:11takluyversetpull_requests: + pull_request6616
2018-05-17 20:28:39mbussonnsetnosy: + mbussonn
messages: + msg316982
2018-03-11 09:22:55takluyversetnosy: + takluyver
messages: + msg313591
2015-10-05 07:35:53martin.pantersetmessages: + msg252309
2015-10-05 06:04:47serhiy.storchakasetfiles: + tokenize_str_2.diff
2015-10-05 06:01:59serhiy.storchakasetmessages: + msg252305
versions: + Python 3.6, - Python 3.4
2015-10-05 03:27:56martin.pantersetnosy: + martin.panter

messages: + msg252300
stage: patch review
2013-02-04 17:06:59r.david.murraylinkissue17125 superseder
2012-12-29 05:23:22meador.ingesetnosy: + meador.inge
messages: + msg178473
2012-10-15 20:48:44serhiy.storchakasetfiles: + tokenize_str.diff
versions: + Python 3.4, - Python 3.3
nosy: + serhiy.storchaka

messages: + msg173001

keywords: + patch
2012-10-14 04:15:32eric.snowsetnosy: terry.reedy, vstinner, Devin Jeanpierre, eric.araujo, eric.snow, petri.lehtinen
2011-07-09 20:53:49eric.snowsetnosy: + eric.snow
2011-07-09 09:03:46vstinnersetmessages: + msg140055
2011-07-09 05:34:16terry.reedysetnosy: + terry.reedy
messages: + msg140050
2011-07-08 17:49:51petri.lehtinensetnosy: + petri.lehtinen
2011-07-04 16:23:57eric.araujosetnosy: + vstinner, eric.araujo

type: enhancement
versions: + Python 3.3
2011-07-04 03:58:16Devin Jeanpierrecreate