classification
Title: Doctest sees directives in strings when it should only see them in comments
Type: enhancement Stage: patch review
Components: Library (Lib) Versions: Python 3.3
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Devin Jeanpierre, benjamin.peterson, petri.lehtinen, r.david.murray, tim.peters
Priority: normal Keywords: patch

Created on 2011-04-22 19:31 by Devin Jeanpierre, last changed 2011-07-04 03:45 by Devin Jeanpierre.

Files
File name Uploaded Description Edit
comments.diff Devin Jeanpierre, 2011-04-22 19:31 patch to tip review
comments2.diff Devin Jeanpierre, 2011-07-03 23:15 review
comments3.diff Devin Jeanpierre, 2011-07-04 03:45
Messages (8)
msg134278 - (view) Author: Devin Jeanpierre (Devin Jeanpierre) * Date: 2011-04-22 19:31
From the doctest source:

'Option directives are comments starting with "doctest:".  Warning: this may give false  positives for string-literals that contain the string "#doctest:".  Eliminating these false positives would require actually parsing the string; but we limit them by ignoring any line containing "#doctest:" that is *followed* by a quote mark.'

This isn't a huge deal, but it's a bit annoying. Above being confusing, this is in contradiction with the doctest documentation, which states:

'Doctest directives are expressed as a special Python comment following an example’s source code'

No mention is made of this corner case where the regexp breaks.

As per the comment in the source, the patched version parses the source using the tokenize module, and runs a modified directive regex on all comment tokens to find directives.
msg138780 - (view) Author: Petri Lehtinen (petri.lehtinen) * (Python committer) Date: 2011-06-21 11:10
The patch looks good to me. It passes the old doctests tests and adds a new test case for what it's fixing.
msg138886 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-06-24 01:45
For the most part the patch looks good to me, too.  My one concern is the encoding.  tokenize detects the encoding...is it possible for the doctest fragment to be detected to be some encoding other than utf-8?
msg138893 - (view) Author: Devin Jeanpierre (Devin Jeanpierre) * Date: 2011-06-24 08:40
You're right, and good catch. If a doctest starts with a "#coding:XXX" line, this should break.

One option is to replace the call to tokenize.tokenize with a call to tokenize._tokenize and pass 'utf-8' as a parameter. Downside: that's a private and undocumented API. The alternative is to manually add a coding line that specifies UTF-8, so that any coding line in the doctest would be ignored. 

My preferred option would be to add the ability to read unicode to the tokenize API, and then use that. I can file a separate ticket if that sounds good, since it's probably useful to others too.

One other thing to be worried about -- I'm not sure how doctest would treat tests with leading "coding:XXX" lines. I'd hope it ignores them, if it doesn't then this is more complicated and the above stuff wouldn't work.

I'll see if I have the time to play around with this (and add more test cases to the patch, correspondingly) this weekend.
msg138948 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-06-24 14:16
I agree that having a unicode API for tokenize seems to make sense, and that would indeed require a separate issue.

That's a good point about doctest not otherwise supporting coding cookies.  Those only really apply to source files.  So no doctest fragments ought to contain coding cookies at the start, so your patch ought to be fine.  But I'm not familiar with the doctest internals, so having some tests to prove everything is fine would be great.

Your code could use the tokenize sniffer to make sure the fragment reads as utf-8 and throw an error otherwise.  But using a unicode interface to tokenize would probably be cleaner, since I suspect it would mimic what doctest does otherwise (ignore coding cookies).  But I don't *know* the latter, so your checking it would be appreciated.
msg139715 - (view) Author: Devin Jeanpierre (Devin Jeanpierre) * Date: 2011-07-03 23:15
Updated patch to newest revision, and to use _tokenize function and includes a test case to verify that it ignores the encoding directive during the tokenization (and every other) step.

I'll file a tokenize bug separately.
msg139721 - (view) Author: Devin Jeanpierre (Devin Jeanpierre) * Date: 2011-07-04 00:40
Erp I forgot to run this against the rest of the tests. Disregard, I'll fix it up a bit later.
msg139732 - (view) Author: Devin Jeanpierre (Devin Jeanpierre) * Date: 2011-07-04 03:45
Updated.
History
Date User Action Args
2011-07-04 03:45:34Devin Jeanpierresetfiles: + comments3.diff

messages: + msg139732
2011-07-04 00:40:33Devin Jeanpierresetmessages: + msg139721
2011-07-03 23:15:50Devin Jeanpierresetfiles: + comments2.diff

messages: + msg139715
2011-06-24 14:16:25r.david.murraysetmessages: + msg138948
2011-06-24 08:40:44Devin Jeanpierresetmessages: + msg138893
2011-06-24 01:45:14r.david.murraysetnosy: + r.david.murray, benjamin.peterson
messages: + msg138886
2011-06-21 11:10:42petri.lehtinensetnosy: + tim.peters, petri.lehtinen
messages: + msg138780
2011-04-22 19:47:07r.david.murraysetstage: patch review
type: enhancement
versions: + Python 3.3
2011-04-22 19:31:22Devin Jeanpierrecreate