Author martin.panter
Recipients Devin Jeanpierre, eric.araujo, eric.snow, martin.panter, meador.inge, petri.lehtinen, serhiy.storchaka, terry.reedy, vstinner
Date 2015-10-05.03:27:56
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1444015676.89.0.631468630308.issue12486@psf.upfronthosting.co.za>
In-reply-to
Content
I agree it would be very useful to be able to tokenize arbitrary text without worrying about encoding tokens. I left some suggestions for the documentation changes. Also some test cases for it would be good.

However I wonder if a separate function would be better for the text mode tokenization. It would make it clearer when an ENCODING token is expected and when it isn’t, and would avoid any confusion about what happens when readline() returns a byte string one time and a text string another time. Also, having untokenize() changes its output type depending on the ENCODING token seems like bad design to me.

Why not just bless the existing generate_tokens() function as a public API, perhaps renaming it to something clearer like tokenize_text() or tokenize_text_lines() at the same time?
History
Date User Action Args
2015-10-05 03:27:57martin.pantersetrecipients: + martin.panter, terry.reedy, vstinner, Devin Jeanpierre, eric.araujo, meador.inge, eric.snow, petri.lehtinen, serhiy.storchaka
2015-10-05 03:27:56martin.pantersetmessageid: <1444015676.89.0.631468630308.issue12486@psf.upfronthosting.co.za>
2015-10-05 03:27:56martin.panterlinkissue12486 messages
2015-10-05 03:27:56martin.pantercreate