Author eric.snow
Recipients brett.cannon, eric.snow, loewis
Date 2012-04-20.05:17:16
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <>

The behavior of tokenize.detect_encoding() and PyTokenizer_FindEncodingFilename() is unexpectedly different and this has bearing on the current work on imports.

When a file has no encoding indicator (see PEP 263) it falls back to UTF8 (see PEP 3120).  The tokenize module (Lib/ facilitates this through "detect_encoding()".  The CPython internal tokenizer (Python/tokenizer.c) does so through "PyTokenizer_FindEncodingFilename()".  Both check the first two lines of the file, per PEP 263.

When faced with an unparsable file (per the encoding), tokenize.detect_encoding() will gladly give you the encoding without any fuss.  However, PyTokenizer_FindEncodingFilename() will raise a SyntaxError in that situation.

The 'badsyntax_pep3120' test (Lib/test/ is one module that demonstrates this discrepency.  I'll use it in the following example.


For tokenize.detect_encoding():

  import tokenize
  enc = tokenize.detect_encoding(open("cpython/Lib/test/").readline)
  print(enc)  # "utf-8" (no SyntaxError)

For PyTokenizer_FindEncodingFilename():

I've attached the source for a C extension module ('_tokenizer') that wraps PyTokenizer_FindEncodingFilename().

  import _tokenizer
  enc = _tokenizer.detect_encoding("cpython/Lib/test/")
  print(enc)  # raises SyntaxError


Some relevant, related notes:

The discrepencies extend further too.  The following code returns a UnicodeDecodeError, rather than a SyntaxError:


In 3.1 (C-based import machinery, Python/import.c), the following results in a SyntaxError, during encoding detection.  In the current repo tip (importlib-based import machinery, Lib/importlib/, the following results in a SyntaxError much later, during compilation.

  import test.badsyntax_pep3120

importlib uses tokenize.detect_encoding() and import.c uses PyTokenizer_FindEncodingFilename()...
Date User Action Args
2012-04-20 05:17:18eric.snowsetrecipients: + eric.snow, loewis, brett.cannon
2012-04-20 05:17:18eric.snowsetmessageid: <>
2012-04-20 05:17:17eric.snowlinkissue14629 messages
2012-04-20 05:17:17eric.snowcreate