This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: doctest in python2.7 can't handle non-ascii characters
Type: behavior Stage: resolved
Components: Library (Lib), Tests, Unicode Versions: Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: benjamin.peterson, eric.araujo, ezio.melotti, flox, hugo, vstinner
Priority: normal Keywords:

Created on 2010-07-29 01:24 by hugo, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
ascii.txt hugo, 2010-07-29 01:24 not working doctest file
non-ascii.txt hugo, 2010-07-29 01:24 working doctest file
example.py hugo, 2010-07-29 01:25 runner file
Messages (4)
msg111881 - (view) Author: Hugo Lopes Tavares (hugo) * Date: 2010-07-29 01:24
When trying to run my test suite I had a problem with python2.7. My suite ran 100% in Python2.4, Python2.5, Python2.6 and Python3.2a0, so I thought it would be a kind of doctest flaw.

Taking a look at the code, there is the following in doctest.py:1331:

            source = example.source.encode('ascii', 'backslashreplace')

The problem is that my doctest file had non-ascii files and I got trouble.

hugo@hugo-laptop:~/issue$ python2.7 example.py 
non-ascii.txt
Doctest: non-ascii.txt ... ok
ascii.txt
Doctest: ascii.txt ... ERROR

======================================================================
ERROR: ascii.txt
Doctest: ascii.txt
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/doctest.py", line 2148, in runTest
    test, out=new.write, clear_globs=False)
  File "/usr/local/lib/python2.7/doctest.py", line 1382, in run
    return self.__run(test, compileflags, out)
  File "/usr/local/lib/python2.7/doctest.py", line 1272, in __run
    got += _exception_traceback(exc_info)
  File "/usr/local/lib/python2.7/doctest.py", line 244, in _exception_traceback
    traceback.print_exception(exc_type, exc_val, exc_tb, file=excout)
  File "/usr/local/lib/python2.7/traceback.py", line 125, in print_exception
    print_tb(tb, limit, file)
  File "/usr/local/lib/python2.7/traceback.py", line 69, in print_tb
    line = linecache.getline(filename, lineno, f.f_globals)
  File "/usr/local/lib/python2.7/linecache.py", line 14, in getline
    lines = getlines(filename, module_globals)
  File "/usr/local/lib/python2.7/doctest.py", line 1331, in __patched_linecache_getlines
    source = example.source.encode('ascii', 'backslashreplace')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 19: ordinal not in range(128)

----------------------------------------------------------------------
Ran 2 tests in 0.006s

FAILED (errors=1)
hugo@hugo-laptop:~/issue$ 


Taking an inner look at doctest.py in python2.6 and python2.7 I realized there is another inconsistency with filenames in both (I was lucky to try at first a filename that doesn't match the regex):

    __LINECACHE_FILENAME_RE = re.compile(r'<doctest '
                                         r'(?P<name>[\w\.]+)'
                                         r'\[(?P<examplenum>\d+)\]>$')

Well, <name> is the file name, but filenames are not only composed of alphanums and dots. Maybe it should be slightly different, like:

    __LINECACHE_FILENAME_RE = re.compile(r'<doctest '
                                         r'(?P<name>.+?)'
                                         r'\[(?P<examplenum>\d+)\]>$', re.UNICODE)

Because we can have several kinds of names. But it is not the top of the iceberg, anyaway.

To solve my problem, I propose moving back that first snippet to how it was in python2.6. The diff would be:

--- /usr/local/lib/python2.7/doctest.py	2010-07-28 22:07:01.272234398 -0300
+++ doctest.py	2010-07-28 22:20:42.000000000 -0300
@@ -1328,8 +1328,7 @@
         m = self.__LINECACHE_FILENAME_RE.match(filename)
         if m and m.group('name') == self.test.name:
             example = self.test.examples[int(m.group('examplenum'))]
-            source = example.source.encode('ascii', 'backslashreplace')
-            return source.splitlines(True)
+            return example.source.splitlines(True)
         else:
             return self.save_linecache_getlines(filename, module_globals)
msg111882 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2010-07-29 01:32
This change has been introduced in r79307 (see #7667).
The error seems to be raised because example.source is not unicode so it gets decoded implicitly before getting encoded with ascii+backslashreplace. I don't know if example.source is always supposed to be str or if the type might be different in some situations.
msg112233 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2010-08-01 00:02
Adding the release manager to nosy so that he can confirm this bugfix can make it in the next 2.7 release before there’s more effort on that.
msg118719 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2010-10-14 21:48
Fixed in 2.7 with r85496 and r85501. Thank you.

(in 3.2 only tests, r85495 and r85500)
History
Date User Action Args
2022-04-11 14:57:04adminsetgithub: 53655
2010-10-14 21:48:31floxsetstatus: open -> closed
resolution: fixed
messages: + msg118719

stage: test needed -> resolved
2010-08-01 00:02:58eric.araujosetnosy: + eric.araujo, benjamin.peterson
messages: + msg112233
2010-08-01 00:01:34eric.araujosetnosy: + vstinner
2010-07-29 01:32:03ezio.melottisetnosy: + ezio.melotti, flox
messages: + msg111882

components: + Library (Lib), Tests, Unicode
stage: test needed
2010-07-29 01:25:20hugosetfiles: + example.py
2010-07-29 01:24:59hugosetfiles: + non-ascii.txt
2010-07-29 01:24:06hugocreate