Author Devin Jeanpierre
Recipients Devin Jeanpierre
Date 2011-07-04.03:58:16
SpamBayes Score 2.04627e-10
Marked as misclassified No
Message-id <1309751897.67.0.332272921906.issue12486@psf.upfronthosting.co.za>
In-reply-to
Content
tokenize only deals with bytes. Users might want to deal with unicode source (for example, if python source is embedded into a document with an already-known encoding).

The naive approach might be something like:

  def my_readline():
      return my_oldreadline().encode('utf-8')

But this doesn't work for python source that declares its encoding, which might be something other than utf-8. The only safe ways are to either manually add a coding line yourself (there are lots of ways, I picked a dumb one):

  def my_readline_safe(was_read=[]):
      if not was_read:
          was_read.append(True)can 
          return b'# coding: utf-8'
      return my_oldreadline().encode('utf-8')

  tokenstream = tokenize.tokenize(my_readline_safe)

Or to use the same my_readline as before (no added coding line), but instead of passing it to tokenize.tokenize, you could pass it to the undocumented _tokenize function:

    tokenstream = tokenize._tokenize(my_readline, 'utf-8')

Or, ideally, you'd just pass the original readline that produces unicode into a utokenize function:

    tokenstream = tokenize.utokenize(my_oldreadline)
History
Date User Action Args
2011-07-04 03:58:17Devin Jeanpierresetrecipients: + Devin Jeanpierre
2011-07-04 03:58:17Devin Jeanpierresetmessageid: <1309751897.67.0.332272921906.issue12486@psf.upfronthosting.co.za>
2011-07-04 03:58:16Devin Jeanpierrelinkissue12486 messages
2011-07-04 03:58:16Devin Jeanpierrecreate