This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author r.david.murray
Recipients Devin Jeanpierre, benjamin.peterson, petri.lehtinen, r.david.murray, tim.peters
Date 2011-06-24.14:16:25
SpamBayes Score 6.087747e-11
Marked as misclassified No
Message-id <1308924985.95.0.24293192739.issue11909@psf.upfronthosting.co.za>
In-reply-to
Content
I agree that having a unicode API for tokenize seems to make sense, and that would indeed require a separate issue.

That's a good point about doctest not otherwise supporting coding cookies.  Those only really apply to source files.  So no doctest fragments ought to contain coding cookies at the start, so your patch ought to be fine.  But I'm not familiar with the doctest internals, so having some tests to prove everything is fine would be great.

Your code could use the tokenize sniffer to make sure the fragment reads as utf-8 and throw an error otherwise.  But using a unicode interface to tokenize would probably be cleaner, since I suspect it would mimic what doctest does otherwise (ignore coding cookies).  But I don't *know* the latter, so your checking it would be appreciated.
History
Date User Action Args
2011-06-24 14:16:26r.david.murraysetrecipients: + r.david.murray, tim.peters, benjamin.peterson, Devin Jeanpierre, petri.lehtinen
2011-06-24 14:16:25r.david.murraysetmessageid: <1308924985.95.0.24293192739.issue11909@psf.upfronthosting.co.za>
2011-06-24 14:16:25r.david.murraylinkissue11909 messages
2011-06-24 14:16:25r.david.murraycreate