Issue 2811: doctest doesn't treat unicode literals as specified by the file declared encoding

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/47060

classification

Title:	doctest doesn't treat unicode literals as specified by the file declared encoding
Type:	enhancement	Stage:
Components:	Library (Lib)	Versions:	Python 2.7

process

Status:	closed	Resolution:	not a bug
Dependencies:		Superseder:
Assigned To:		Nosy List:	eric.araujo, kmtracey, neves, terry.reedy
Priority:	normal	Keywords:

Created on 2008-05-10 18:47 by neves, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
doctesteerror.py	neves, 2008-05-10 18:47	A small test that reproduces the error. The output is different running the function from doctest and from the program.

Messages (3)
msg66559 - (view)	Author: Paulo Eduardo Neves (neves)	Date: 2008-05-10 18:47
Doctest doesn't obey the specified file encoding for unicode literals. I've put the minimum test case that demonstrate the error in the attached file. The program has the # -- coding: utf-8 -- as the first line and is saved in this encoding. My computer environment is configured as iso8859-1. Doctest ignores the file encoding specification and interprets the u'á' as u'Ã¡' (the utf-8 text decoded as iso8859-1 ) I've reproduced this error in python 2.5 in linux and windows. This is the output of the program below that runs the function normalize from inside doctest and directly from python. They show different results. ******************************************************************** File "doctesteerror.py", line 7, in __main__.normalize Failed example: normalize(u'Ã¡') Expected: u'b' Got: u'\xc3\xa1' ****************************************************************** 1 items had failures: 1 of 1 in __main__.normalize Test Failed** 1 failures. without doctest ===>>> b
msg70907 - (view)	Author: Karen Tracey (kmtracey)	Date: 2008-08-08 16:14
I believe the problem is in your test file, not doctest. The enclosing doctest string is not specified as a unicode literal, so the file encoding specification ultimately has no effect on it. At least that is how I read the documentation regarding the effect of the ecoding declaration ("The encoding is used for all lexical analysis, in particular to find the end of a string, and to interpret the contents of Unicode literals. String literals are converted to Unicode for syntactical analysis, then converted back to their original encoding before interpretation starts.") If you change the test file so that the string enclosing the test is a unicode literal then the test passes: user@gutsy:~/tmp$ cat test_iso-8859-15.py # -- coding: utf-8 -- import doctest def normalize(s): u""" >>> normalize(u'Ã¡') u'b' """ return s.translate({ord(u'Ã¡'): u'b'}) doctest.testmod() print 'without doctest ===>>>', normalize(u'Ã¡') user@gutsy:~/tmp$ python test_iso-8859-15.py without doctest ===>>> b ----- There is a problem with this, though: doctest now will be unable to correctly report errors when there are output mismatches involving unicode strings with non-ASCII chars. For example if you add an 'x' to the front of your unicode literal to be normalized you'll get this when you try to run it: user@gutsy:~/tmp$ python test_iso-8859-15.py Traceback (most recent call last): File "test_iso-8859-15.py", line 12, in <module> doctest.testmod() File "/usr/lib/python2.5/doctest.py", line 1799, in testmod runner.run(test) File "/usr/lib/python2.5/doctest.py", line 1345, in run return self.__run(test, compileflags, out) File "/usr/lib/python2.5/doctest.py", line 1261, in __run self.report_failure(out, test, example, got) File "/usr/lib/python2.5/doctest.py", line 1125, in report_failure self._checker.output_difference(example, got, self.optionflags)) UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 149: ordinal not in range(128) user@gutsy:~/tmp$ This issue is reported in #1293741, but there is no fix or guidance offered there on how to work around the problem. I'd appreciate feedback on whether what I've said here is correct. I'm currently trying to diagnose/fix problems with use of unicode literals in some tests and this is as far as I've got. That is, I think I need to be specifying the enclosing strings as unicode literals, but then I run into #1293741. If the conclusion I've reached is correct, then trying to figure out a fix for that problem should be where I focus my efforts. If, however, I shouldn't be specifying the enclosing string as a unicode literal, then attempting to fix the problem as described here would perhaps be more useful. Though I do not know how the doctest code can know the file's encoding specification?
msg112723 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2010-08-03 23:53
In 3.1.2, where the docstring is unicode, the doctest of normalize works fine, as Karen said. I think she is right: without the encoding being explicitly passed to doctest, it cannot affect how the sub-interpreter used by doctest compiles the strings as code. I am therefore closing this as invalid (or won't fix, or out-of-date). In any case, it strikes me as a feature request, and test modules, especially, should not be enhanced in bugfix releases. If one thinks of it as a bug, the bug was fixed in 3.0 but cannot be backported.

History
Date	User	Action	Args
2022-04-11 14:56:34	admin	set	github: 47060
2010-08-03 23:53:08	terry.reedy	set	status: open -> closed type: behavior -> enhancement versions: + Python 2.7, - Python 2.5 nosy: + terry.reedy messages: + msg112723 resolution: not a bug
2009-11-27 23:59:38	eric.araujo	set	nosy: + eric.araujo
2008-08-08 16:14:24	kmtracey	set	nosy: + kmtracey messages: + msg70907
2008-05-10 18:47:09	neves	create