tokenize module should have a unicode API #56695

ssbr · 2011-07-04T03:58:17Z

BPO	12486
Nosy	@warsaw, @terryjreedy, @mdickinson, @vstinner, @ssbr, @tpn, @merwok, @voidspace, @meadori, @ericsnowcurrently, @takluyver, @akheron, @vadmium, @serhiy-storchaka, @willingc, @Carreau
PRs	bpo-12486: Document tokenize.generate_tokens() as public API #6957
Files	tokenize_str.diff tokenize_str_2.diff

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2018-06-05.17:30:56.930>
created_at = <Date 2011-07-04.03:58:16.983>
labels = ['3.8', 'type-feature', 'library']
title = 'tokenize module should have a unicode API'
updated_at = <Date 2018-06-05.17:51:27.955>
user = 'https://github.com/ssbr'

bugs.python.org fields:

activity = <Date 2018-06-05.17:51:27.955>
actor = 'takluyver'
assignee = 'none'
closed = True
closed_date = <Date 2018-06-05.17:30:56.930>
closer = 'willingc'
components = ['Library (Lib)']
creation = <Date 2011-07-04.03:58:16.983>
creator = 'Devin Jeanpierre'
dependencies = []
files = ['27587', '40679']
hgrepos = []
issue_num = 12486
keywords = ['patch']
message_count = 19.0
messages = ['139733', '140050', '140055', '173001', '178473', '252300', '252305', '252309', '313591', '316982', '317004', '317010', '317011', '317018', '317020', '317021', '317912', '318775', '318778']
nosy_count = 16.0
nosy_names = ['barry', 'terry.reedy', 'mark.dickinson', 'vstinner', 'Devin Jeanpierre', 'trent', 'eric.araujo', 'michael.foord', 'meador.inge', 'eric.snow', 'takluyver', 'petri.lehtinen', 'martin.panter', 'serhiy.storchaka', 'willingc', 'mbussonn']
pr_nums = ['6957']
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue12486'
versions = ['Python 3.8']

ssbr · 2011-07-04T03:58:16Z

tokenize only deals with bytes. Users might want to deal with unicode source (for example, if python source is embedded into a document with an already-known encoding).

The naive approach might be something like:

  def my_readline():
      return my_oldreadline().encode('utf-8')

But this doesn't work for python source that declares its encoding, which might be something other than utf-8. The only safe ways are to either manually add a coding line yourself (there are lots of ways, I picked a dumb one):

  def my_readline_safe(was_read=[]):
      if not was_read:
          was_read.append(True)can 
          return b'# coding: utf-8'
      return my_oldreadline().encode('utf-8')

  tokenstream = tokenize.tokenize(my_readline_safe)

Or to use the same my_readline as before (no added coding line), but instead of passing it to tokenize.tokenize, you could pass it to the undocumented _tokenize function:

    tokenstream = tokenize._tokenize(my_readline, 'utf-8')

Or, ideally, you'd just pass the original readline that produces unicode into a utokenize function:

    tokenstream = tokenize.utokenize(my_oldreadline)

terryjreedy · 2011-07-09T05:34:15Z

Hmm. Python 3 code is unicode. "Python reads program text as Unicode code points." The tokenize module purports to provide "a lexical scanner for Python source code". But it seems not to do that. Instead it provides a scanner for Python code encoded as bytes, which is something different. So this is at least a doc update issue (which affects 2.7/3.2 also). Another doc issue is given below.

A deeper problem is that tokenize uses the semi-obsolete readline protocol, which probably dates to 1.0 and which expects the source to be a file or file-like. The more recent iterator protocol would lets the source be anything. A modern tokenize function should accept an iterable of strings. This would include but not be limited to a file opened in text mode.

A related problem is that 'tokenize' is a convenience function that does several things bundled together.

Read lines as bytes from a file-like source.
Detect encoding.
Decode lines to strings.
Actually tokenize the strings to tokens.

I understand this feature request to be a request that function 4, the actual Python 3 code tokenizer be unbundled and exposed to users. I agree with this request. Any user that starts with actual Py3 code would benefit.

(Compile() is another function that bundles a tokenizer.)

Back to the current doc and another doc problem. The entry for untokenize() says "Converts tokens back into Python source code. ...The reconstructed script is returned as a single string." That would be nice if true, but I am going to guess it is not, as the entry continues "It returns bytes, encoded using the ENCODING token,". In Py3, string != bytes, so this seems an incomplete doc conversion from Py2.

vstinner · 2011-07-09T09:03:46Z

The compiler has a PyCF_SOURCE_IS_UTF8 flag: see compile() builtin. The parser has a flag to ignore the coding cookie: PyPARSE_IGNORE_COOKIE.

Patch tokenize to support Unicode is simple: use PyCF_SOURCE_IS_UTF8 and/or PyPARSE_IGNORE_COOKIE flags and encode the strings to UTF-8.

Rewrite the parser to work directly on Unicode is much more complex and I don't think that we need that.

serhiy-storchaka · 2012-10-15T20:48:44Z

Patch to allow tokenize() accepts string is very simple, only 4 lines. But it requires a lot of documentation changes.

Then we can get rid of undocumented generate_tokens(). Note, stdlib an tools use only generate_tokens(), none uses tokenize(). Of course, it will be better if tokenize() will work with iterator protocol.

Here is a preliminary patch. I will be thankful for the help with the documentation and for the discussion.

Of course, it will be better if tokenize() will work with iterator protocol.

meadori · 2012-12-29T05:23:22Z

See also bpo-9969.

vadmium · 2015-10-05T03:27:56Z

I agree it would be very useful to be able to tokenize arbitrary text without worrying about encoding tokens. I left some suggestions for the documentation changes. Also some test cases for it would be good.

However I wonder if a separate function would be better for the text mode tokenization. It would make it clearer when an ENCODING token is expected and when it isn’t, and would avoid any confusion about what happens when readline() returns a byte string one time and a text string another time. Also, having untokenize() changes its output type depending on the ENCODING token seems like bad design to me.

Why not just bless the existing generate_tokens() function as a public API, perhaps renaming it to something clearer like tokenize_text() or tokenize_text_lines() at the same time?

serhiy-storchaka · 2015-10-05T06:01:59Z

Thank you for your review Martin.

Here is rebased patch that addresses Matin's comments.

I agree that having untokenize() changes its output type depending on the ENCODING token is bad design and we should change this. But this is perhaps other issue.

vadmium · 2015-10-05T07:35:54Z

I didn’t notice that this dual untokenize() behaviour already existed. Taking that into account weakens my argument for having separate text and bytes tokenize() functions.

takluyver · 2018-03-11T09:22:55Z

Why not just bless the existing generate_tokens() function as a public API

We're actually using generate_tokens() from IPython - we wanted a way to tokenize unicode strings, and although it's undocumented, it's been there for a number of releases and does what we want. So +1 to promoting it to a public API.

In fact, at the moment, IPython has its own copy of tokenize to fix one or two old issues. I'm trying to get rid of that and use the stdlib module again, which is how I came to notice that we're using an undocumented API.

Carreau · 2018-05-17T20:28:39Z

Why not just bless the existing generate_tokens() function as a public API,

Yes please, or just make the private _tokenize public under another name. The tokenize.tokenize method try to magically detect encoding which may be unnecessary.

serhiy-storchaka · 2018-05-18T07:01:30Z

The old generate_tokens() was renamed to tokenize() in bpo-719888 because the latter is a better name. Is "generate_tokens" considered a good name now?

takluyver · 2018-05-18T07:48:49Z

I wouldn't say it's a good name, but I think the advantage of documenting an existing name outweighs that. We can start (or continue) using generate_tokens() right away, whereas a new name presumably wouldn't be available until Python 3.8 comes out. And we usually don't require a new Python version until a couple of years after it is released.

If we want to add better names or clearer APIs on top of this, great. But I don't want that discussion to hold up the simple step of committing to keep the existing API.

serhiy-storchaka · 2018-05-18T07:59:47Z

My concern is that we will have two functions with non-similar names (tokenize() and generate_tokens()) that does virtually the same, but accept different types of input (bytes or str), and the single function untokenize() that produces different type of result depending on the value of input. This doesn't look like a good design to me.

takluyver · 2018-05-18T08:21:23Z

I agree, it's not a good design, but it's what's already there; I just want to ensure that it won't be removed without a deprecation cycle. My PR makes no changes to behaviour, only to documentation and tests.

This and bpo-9969 have both been around for several years. A new tokenize API is clearly not at the top of anyone's priority list - and that's fine. I'd rather have *some* unicode API today than a promise of a nice unicode API in the future. And it doesn't preclude adding a better API later, it just means that the existing API would have to have a deprecation cycle.

vadmium · 2018-05-18T08:53:49Z

Don’t forget about updating __all__.

takluyver · 2018-05-18T08:56:09Z

Thanks - I had forgotten it, just fixed it now.

takluyver · 2018-05-28T20:09:07Z

The tests on PR bpo-6957 are passing now, if anyone has time to have a look. :-)

willingc · 2018-06-05T17:26:42Z

New changeset c56b17b by Carol Willing (Thomas Kluyver) in branch 'master':
bpo-12486: Document tokenize.generate_tokens() as public API (bpo-6957)
c56b17b

takluyver · 2018-06-05T17:51:28Z

Thanks Carol :-)

ssbr mannequin added the stdlib Python modules in the Lib dir label Jul 4, 2011

merwok added the type-feature A feature request or enhancement label Jul 4, 2011

serhiy-storchaka added the 3.8 only security fixes label May 18, 2018

willingc closed this as completed Jun 5, 2018

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tokenize module should have a unicode API #56695

tokenize module should have a unicode API #56695

ssbr mannequin commented Jul 4, 2011

ssbr mannequin commented Jul 4, 2011

terryjreedy commented Jul 9, 2011

vstinner commented Jul 9, 2011

serhiy-storchaka commented Oct 15, 2012

meadori commented Dec 29, 2012

vadmium commented Oct 5, 2015

serhiy-storchaka commented Oct 5, 2015

vadmium commented Oct 5, 2015

takluyver mannequin commented Mar 11, 2018

Carreau mannequin commented May 17, 2018

serhiy-storchaka commented May 18, 2018

takluyver mannequin commented May 18, 2018

serhiy-storchaka commented May 18, 2018

takluyver mannequin commented May 18, 2018

vadmium commented May 18, 2018

takluyver mannequin commented May 18, 2018

takluyver mannequin commented May 28, 2018

willingc commented Jun 5, 2018

takluyver mannequin commented Jun 5, 2018

tokenize module should have a unicode API #56695

tokenize module should have a unicode API #56695

Comments

ssbr mannequin commented Jul 4, 2011

ssbr mannequin commented Jul 4, 2011

terryjreedy commented Jul 9, 2011

vstinner commented Jul 9, 2011

serhiy-storchaka commented Oct 15, 2012

meadori commented Dec 29, 2012

vadmium commented Oct 5, 2015

serhiy-storchaka commented Oct 5, 2015

vadmium commented Oct 5, 2015

takluyver mannequin commented Mar 11, 2018

Carreau mannequin commented May 17, 2018

serhiy-storchaka commented May 18, 2018

takluyver mannequin commented May 18, 2018

serhiy-storchaka commented May 18, 2018

takluyver mannequin commented May 18, 2018

vadmium commented May 18, 2018

takluyver mannequin commented May 18, 2018

takluyver mannequin commented May 28, 2018

willingc commented Jun 5, 2018

takluyver mannequin commented Jun 5, 2018