This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author terry.reedy
Recipients Devin Jeanpierre, eric.araujo, petri.lehtinen, terry.reedy, vstinner
Date 2011-07-09.05:34:15
SpamBayes Score 0.0
Marked as misclassified No
Message-id <1310189656.58.0.933304419863.issue12486@psf.upfronthosting.co.za>
In-reply-to
Content
Hmm. Python 3 code is unicode. "Python reads program text as Unicode code points." The tokenize module purports to provide "a lexical scanner for Python source code". But it seems not to do that. Instead it provides a scanner for Python code encoded as bytes, which is something different. So this is at least a doc update issue (which affects 2.7/3.2 also). Another doc issue is given below.

A deeper problem is that tokenize uses the semi-obsolete readline protocol, which probably dates to 1.0 and which expects the source to be a file or file-like. The more recent iterator protocol would lets the source be anything. A modern tokenize function should accept an iterable of  strings. This would include but not be limited to a file opened in text mode.

A related problem is that 'tokenize' is a convenience function that does several things bundled together.
1. Read lines as bytes from a file-like source.
2. Detect encoding.
3. Decode lines to strings.
4. Actually tokenize the strings to tokens.

I understand this feature request to be a request that function 4, the actual Python 3 code tokenizer be unbundled and exposed to users. I agree with this request. Any user that starts with actual Py3 code would benefit.

(Compile() is another function that bundles a tokenizer.)

Back to the current doc and another doc problem. The entry for untokenize() says "Converts tokens back into Python source code. ...The reconstructed script is returned as a single string." That would be nice if true, but I am going to guess it is not, as the entry continues "It returns bytes, encoded using the ENCODING token,". In Py3, string != bytes, so this seems an incomplete doc conversion from Py2.
History
Date User Action Args
2011-07-09 05:34:16terry.reedysetrecipients: + terry.reedy, vstinner, Devin Jeanpierre, eric.araujo, petri.lehtinen
2011-07-09 05:34:16terry.reedysetmessageid: <1310189656.58.0.933304419863.issue12486@psf.upfronthosting.co.za>
2011-07-09 05:34:16terry.reedylinkissue12486 messages
2011-07-09 05:34:15terry.reedycreate