Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tokenize module should have a unicode API #56695

Closed
ssbr mannequin opened this issue Jul 4, 2011 · 19 comments
Closed

tokenize module should have a unicode API #56695

ssbr mannequin opened this issue Jul 4, 2011 · 19 comments
Labels
3.8 only security fixes stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@ssbr
Copy link
Mannequin

ssbr mannequin commented Jul 4, 2011

BPO 12486
Nosy @warsaw, @terryjreedy, @mdickinson, @vstinner, @ssbr, @tpn, @merwok, @voidspace, @meadori, @ericsnowcurrently, @takluyver, @akheron, @vadmium, @serhiy-storchaka, @willingc, @Carreau
PRs
  • bpo-12486: Document tokenize.generate_tokens() as public API #6957
  • Files
  • tokenize_str.diff
  • tokenize_str_2.diff
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2018-06-05.17:30:56.930>
    created_at = <Date 2011-07-04.03:58:16.983>
    labels = ['3.8', 'type-feature', 'library']
    title = 'tokenize module should have a unicode API'
    updated_at = <Date 2018-06-05.17:51:27.955>
    user = 'https://github.com/ssbr'

    bugs.python.org fields:

    activity = <Date 2018-06-05.17:51:27.955>
    actor = 'takluyver'
    assignee = 'none'
    closed = True
    closed_date = <Date 2018-06-05.17:30:56.930>
    closer = 'willingc'
    components = ['Library (Lib)']
    creation = <Date 2011-07-04.03:58:16.983>
    creator = 'Devin Jeanpierre'
    dependencies = []
    files = ['27587', '40679']
    hgrepos = []
    issue_num = 12486
    keywords = ['patch']
    message_count = 19.0
    messages = ['139733', '140050', '140055', '173001', '178473', '252300', '252305', '252309', '313591', '316982', '317004', '317010', '317011', '317018', '317020', '317021', '317912', '318775', '318778']
    nosy_count = 16.0
    nosy_names = ['barry', 'terry.reedy', 'mark.dickinson', 'vstinner', 'Devin Jeanpierre', 'trent', 'eric.araujo', 'michael.foord', 'meador.inge', 'eric.snow', 'takluyver', 'petri.lehtinen', 'martin.panter', 'serhiy.storchaka', 'willingc', 'mbussonn']
    pr_nums = ['6957']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue12486'
    versions = ['Python 3.8']

    @ssbr
    Copy link
    Mannequin Author

    ssbr mannequin commented Jul 4, 2011

    tokenize only deals with bytes. Users might want to deal with unicode source (for example, if python source is embedded into a document with an already-known encoding).

    The naive approach might be something like:

      def my_readline():
          return my_oldreadline().encode('utf-8')

    But this doesn't work for python source that declares its encoding, which might be something other than utf-8. The only safe ways are to either manually add a coding line yourself (there are lots of ways, I picked a dumb one):

      def my_readline_safe(was_read=[]):
          if not was_read:
              was_read.append(True)can 
              return b'# coding: utf-8'
          return my_oldreadline().encode('utf-8')
    
      tokenstream = tokenize.tokenize(my_readline_safe)

    Or to use the same my_readline as before (no added coding line), but instead of passing it to tokenize.tokenize, you could pass it to the undocumented _tokenize function:

        tokenstream = tokenize._tokenize(my_readline, 'utf-8')

    Or, ideally, you'd just pass the original readline that produces unicode into a utokenize function:

        tokenstream = tokenize.utokenize(my_oldreadline)

    @ssbr ssbr mannequin added the stdlib Python modules in the Lib dir label Jul 4, 2011
    @merwok merwok added the type-feature A feature request or enhancement label Jul 4, 2011
    @terryjreedy
    Copy link
    Member

    Hmm. Python 3 code is unicode. "Python reads program text as Unicode code points." The tokenize module purports to provide "a lexical scanner for Python source code". But it seems not to do that. Instead it provides a scanner for Python code encoded as bytes, which is something different. So this is at least a doc update issue (which affects 2.7/3.2 also). Another doc issue is given below.

    A deeper problem is that tokenize uses the semi-obsolete readline protocol, which probably dates to 1.0 and which expects the source to be a file or file-like. The more recent iterator protocol would lets the source be anything. A modern tokenize function should accept an iterable of strings. This would include but not be limited to a file opened in text mode.

    A related problem is that 'tokenize' is a convenience function that does several things bundled together.

    1. Read lines as bytes from a file-like source.
    2. Detect encoding.
    3. Decode lines to strings.
    4. Actually tokenize the strings to tokens.

    I understand this feature request to be a request that function 4, the actual Python 3 code tokenizer be unbundled and exposed to users. I agree with this request. Any user that starts with actual Py3 code would benefit.

    (Compile() is another function that bundles a tokenizer.)

    Back to the current doc and another doc problem. The entry for untokenize() says "Converts tokens back into Python source code. ...The reconstructed script is returned as a single string." That would be nice if true, but I am going to guess it is not, as the entry continues "It returns bytes, encoded using the ENCODING token,". In Py3, string != bytes, so this seems an incomplete doc conversion from Py2.

    @vstinner
    Copy link
    Member

    vstinner commented Jul 9, 2011

    The compiler has a PyCF_SOURCE_IS_UTF8 flag: see compile() builtin. The parser has a flag to ignore the coding cookie: PyPARSE_IGNORE_COOKIE.

    Patch tokenize to support Unicode is simple: use PyCF_SOURCE_IS_UTF8 and/or PyPARSE_IGNORE_COOKIE flags and encode the strings to UTF-8.

    Rewrite the parser to work directly on Unicode is much more complex and I don't think that we need that.

    @serhiy-storchaka
    Copy link
    Member

    Patch to allow tokenize() accepts string is very simple, only 4 lines. But it requires a lot of documentation changes.

    Then we can get rid of undocumented generate_tokens(). Note, stdlib an tools use only generate_tokens(), none uses tokenize(). Of course, it will be better if tokenize() will work with iterator protocol.

    Here is a preliminary patch. I will be thankful for the help with the documentation and for the discussion.

    Of course, it will be better if tokenize() will work with iterator protocol.

    @meadori
    Copy link
    Member

    meadori commented Dec 29, 2012

    See also bpo-9969.

    @vadmium
    Copy link
    Member

    vadmium commented Oct 5, 2015

    I agree it would be very useful to be able to tokenize arbitrary text without worrying about encoding tokens. I left some suggestions for the documentation changes. Also some test cases for it would be good.

    However I wonder if a separate function would be better for the text mode tokenization. It would make it clearer when an ENCODING token is expected and when it isn’t, and would avoid any confusion about what happens when readline() returns a byte string one time and a text string another time. Also, having untokenize() changes its output type depending on the ENCODING token seems like bad design to me.

    Why not just bless the existing generate_tokens() function as a public API, perhaps renaming it to something clearer like tokenize_text() or tokenize_text_lines() at the same time?

    @serhiy-storchaka
    Copy link
    Member

    Thank you for your review Martin.

    Here is rebased patch that addresses Matin's comments.

    I agree that having untokenize() changes its output type depending on the ENCODING token is bad design and we should change this. But this is perhaps other issue.

    @vadmium
    Copy link
    Member

    vadmium commented Oct 5, 2015

    I didn’t notice that this dual untokenize() behaviour already existed. Taking that into account weakens my argument for having separate text and bytes tokenize() functions.

    @takluyver
    Copy link
    Mannequin

    takluyver mannequin commented Mar 11, 2018

    Why not just bless the existing generate_tokens() function as a public API

    We're actually using generate_tokens() from IPython - we wanted a way to tokenize unicode strings, and although it's undocumented, it's been there for a number of releases and does what we want. So +1 to promoting it to a public API.

    In fact, at the moment, IPython has its own copy of tokenize to fix one or two old issues. I'm trying to get rid of that and use the stdlib module again, which is how I came to notice that we're using an undocumented API.

    @Carreau
    Copy link
    Mannequin

    Carreau mannequin commented May 17, 2018

    Why not just bless the existing generate_tokens() function as a public API,

    Yes please, or just make the private _tokenize public under another name. The tokenize.tokenize method try to magically detect encoding which may be unnecessary.

    @serhiy-storchaka
    Copy link
    Member

    The old generate_tokens() was renamed to tokenize() in bpo-719888 because the latter is a better name. Is "generate_tokens" considered a good name now?

    @serhiy-storchaka serhiy-storchaka added the 3.8 only security fixes label May 18, 2018
    @takluyver
    Copy link
    Mannequin

    takluyver mannequin commented May 18, 2018

    I wouldn't say it's a good name, but I think the advantage of documenting an existing name outweighs that. We can start (or continue) using generate_tokens() right away, whereas a new name presumably wouldn't be available until Python 3.8 comes out. And we usually don't require a new Python version until a couple of years after it is released.

    If we want to add better names or clearer APIs on top of this, great. But I don't want that discussion to hold up the simple step of committing to keep the existing API.

    @serhiy-storchaka
    Copy link
    Member

    My concern is that we will have two functions with non-similar names (tokenize() and generate_tokens()) that does virtually the same, but accept different types of input (bytes or str), and the single function untokenize() that produces different type of result depending on the value of input. This doesn't look like a good design to me.

    @takluyver
    Copy link
    Mannequin

    takluyver mannequin commented May 18, 2018

    I agree, it's not a good design, but it's what's already there; I just want to ensure that it won't be removed without a deprecation cycle. My PR makes no changes to behaviour, only to documentation and tests.

    This and bpo-9969 have both been around for several years. A new tokenize API is clearly not at the top of anyone's priority list - and that's fine. I'd rather have *some* unicode API today than a promise of a nice unicode API in the future. And it doesn't preclude adding a better API later, it just means that the existing API would have to have a deprecation cycle.

    @vadmium
    Copy link
    Member

    vadmium commented May 18, 2018

    Don’t forget about updating __all__.

    @takluyver
    Copy link
    Mannequin

    takluyver mannequin commented May 18, 2018

    Thanks - I had forgotten it, just fixed it now.

    @takluyver
    Copy link
    Mannequin

    takluyver mannequin commented May 28, 2018

    The tests on PR bpo-6957 are passing now, if anyone has time to have a look. :-)

    @willingc
    Copy link
    Contributor

    willingc commented Jun 5, 2018

    New changeset c56b17b by Carol Willing (Thomas Kluyver) in branch 'master':
    bpo-12486: Document tokenize.generate_tokens() as public API (bpo-6957)
    c56b17b

    @willingc willingc closed this as completed Jun 5, 2018
    @takluyver
    Copy link
    Mannequin

    takluyver mannequin commented Jun 5, 2018

    Thanks Carol :-)

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.8 only security fixes stdlib Python modules in the Lib dir type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    7 participants