classification
Title: tokenize.open(): open a file with encoding detected from a coding cookie
Type: enhancement Stage: patch review
Components: Library (Lib), Unicode Versions: Python 3.2
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: belopolsky, eric.araujo, georg.brandl, pitrou, vstinner
Priority: normal Keywords: patch

Created on 2010-11-06 10:49 by vstinner, last changed 2010-11-09 01:12 by vstinner. This issue is now closed.

Files
File name Uploaded Description Edit
open_python.patch vstinner, 2010-11-06 10:49 review
tokenize_open-2.patch vstinner, 2010-11-07 09:24
Messages (15)
msg120600 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-11-06 10:49
In Python3, the following pattern becomes common:

        with open(fullname, 'rb') as fp:
            coding, line = tokenize.detect_encoding(fp.readline)
        with open(fullname, 'r', encoding=coding) as fp:
            ...

It opens the file is opened twice, whereas it is unnecessary: it's possible to reuse the raw buffer to create a text file. And I don't like the detect_encoding() API: pass the readline function is not intuitive.

I propose to create tokenize.open_python() function with a very simple API: just one argument, the filename. This function calls detect_encoding() and only open the file once.

Attached python adds the function with an unit test and a patch on the documentation. It patchs also functions currently using detect_encoding().

open_python() only supports read mode. I suppose that it is enough.
msg120616 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2010-11-06 14:31
Looks good.

In the test, do you have to remove the TESTFN file manually?  Isn’t that handled by unittest or regrtest?

I think the name of the function could be better but I don’t have a proposal.
msg120628 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-11-06 15:42
> In the test, do you have to remove the TESTFN file manually?  Isn’t that
> handled by unittest or regrtest?

The test uses TESTFN+'.py', not TESTFN ;-)

> I think the name of the function could be better but I don’t have a
> proposal.

Do you prefer a name like tokenize.open()?
msg120629 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2010-11-06 15:44
> The test uses TESTFN+'.py', not TESTFN ;-)
Ah, right.  You can use addCleanup then.

> Do you prefer a name like tokenize.open()?
Hm, tokenize being “Tokenization help for Python programs”, I think this name is rather good.  Or tokenize.open_source.
msg120632 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010-11-06 16:00
+1 on adding the function.

Note that it is useful for opening any text file with an encoding cookie, not only python source code, so "tokenize.open()" sounds attractive.

Once we are at it, I would like to factor out and document code that extracts the cookie from a decoded string.  This functionality is useful for tools that produce annotated source code and want to preserve source encoding in their output.  See issue10329.

Also, to answer Victor's comment on issue10329, msg120601, while using this function simplifies dealing with source files in trace module, it does not solve all the issues: source file will still be opened separately in linecache.gettines() and find_executable_linenos() and  find_executable_linenos() still does not respect __loader__ interface.  See issue 10342.
msg120635 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2010-11-06 16:15
> Note that it is useful for opening any text file with an encoding
> cookie, not only python source code,
This is cool.  doctest could benefit from that, I think.

> so "tokenize.open()" sounds attractive.
Agreed, even though the docstring of tokenize does talk about Python code.
msg120637 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010-11-06 16:25
On Sat, Nov 6, 2010 at 12:15 PM, Éric Araujo <report@bugs.python.org> wrote:
..
>> so "tokenize.open()" sounds attractive.
> Agreed, even though the docstring of tokenize does talk about Python code.

I still like "tokenize.open()" more, but for the sake of completeness
of discussion, what about open(.., encoding="fromcookie")?  (I know,
tokenize.detect_encoding() also looks for an utf-8 BOM, so
"fromcookie" can be improved.)
msg120638 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2010-11-06 16:32
I don’t know if it would be okay to depend on tokenize in io.  Anyway, I tend to prefer tokenize.open over builtins.open(..., encoding='fromcookie').
msg120643 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2010-11-06 18:00
+1 for the feature.  However, tokenize.open() sounds a bit unspecific.  But I don't really have better suggestions; open_cookie() is wrong too, since it doesn't open cookies :)
msg120661 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-11-07 09:04
> what about open(.., encoding="fromcookie")?

Please don't do that. It remembers me the magical str.encode() method of 
Python2. I prefer simple API with limited features: if you want to open a 
Python script, use tokenize.open(), if you want to open an XML document use 
any XML library, etc. It remembers me also the discussion about detecting BOM 
in text files: issue #7651. There was no consensus, and so the issue is still 
open.

For this particular issue, I prefer a specific function tokenize.<choose a 
function name>.
msg120662 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-11-07 09:24
On Saturday 06 November 2010 17:00:15 you wrote:
> Note that it is useful for opening any text file with an encoding cookie,
> not only python source code, so "tokenize.open()" sounds attractive.

Ok, the new patch (tokenize_open-2.patch) uses tokenize.open() name and adds a 
test for BOM without coding cookie (test utf-8-sig encoding).
msg120755 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010-11-08 15:28
On Sun, Nov 7, 2010 at 4:24 AM, STINNER Victor <report@bugs.python.org> wrote:
..
> Ok, the new patch (tokenize_open-2.patch) uses tokenize.open() name and adds a
> test for BOM without coding cookie (test utf-8-sig encoding).

Here are my comments on the new patch:

1. A nit-pick: tokenize.py is already inconsistent in this respect,
but I believe it is conventional not to start a docsting with an empty
line:

+def open(filename):
+    """Open a Python script in read mode with the right encoding.

instead of

+def open(filename):
+    """
+    Open a Python script in read mode with the right encoding.

Also, I would prefer a more neutral description that will not imply
that the method is only useful for Python source code.  For example:

"""Detect encoding and open a file in read mode with the right encoding.

(optionally a more detailed explanation or a reference to detect_encoding.)
"""

2. Another nit-pick: detect_encoding() returns a list of 0-2 lines in
the second item, not a single line, so it is better to write

+    encoding, lines = detect_encoding(buffer.readline)

or even

+    encoding, _ = detect_encoding(buffer.readline)

instead of

+    encoding, line = detect_encoding(buffer.readline)

3. In test_open(self), I think you should use self.assertEqual()
instead of a plain assert.

4. Last nit-pick (feel free to ignore as prejudice to %-formatting :-):

instead of

+                print("# coding: %s" % encoding, file=fp)

you can write

+                print("# coding: " + encoding, file=fp)
msg120756 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010-11-08 15:34
Changing the title to make the latest choice of function name more visible.
msg120758 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2010-11-08 15:35
1. Agreed on both.  Extraneous blank lines seem to be favored by people who need to work around bugs in their editor.

2. Yes, “_” is recommended for a discarded object.

3. Agreed.

4. Even better: print('# coding:', encoding, file=fp).  No intermediary string!
msg120831 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-11-09 01:12
Commited to Python 3.2 (r86346 + r86347). Thanks for your reviews.
History
Date User Action Args
2010-11-09 01:12:02vstinnersetstatus: open -> closed
resolution: fixed
messages: + msg120831
2010-11-08 15:35:39eric.araujosetmessages: + msg120758
title: tokenize.open(): open a Python file with the right encoding -> tokenize.open(): open a file with encoding detected from a coding cookie
2010-11-08 15:34:35belopolskysettype: enhancement
title: tokenize.open_python(): open a Python file with the right encoding -> tokenize.open(): open a Python file with the right encoding
messages: + msg120756
stage: patch review
2010-11-08 15:28:13belopolskysetmessages: + msg120755
2010-11-07 09:24:45vstinnersetfiles: + tokenize_open-2.patch

messages: + msg120662
2010-11-07 09:04:47vstinnersetmessages: + msg120661
2010-11-06 18:00:59georg.brandlsetnosy: + georg.brandl
messages: + msg120643
2010-11-06 16:32:00eric.araujosetnosy: + pitrou
messages: + msg120638
2010-11-06 16:25:31belopolskysetmessages: + msg120637
2010-11-06 16:15:12eric.araujosetmessages: + msg120635
2010-11-06 16:00:12belopolskysetnosy: + belopolsky
messages: + msg120632
2010-11-06 15:44:13eric.araujosetmessages: + msg120629
2010-11-06 15:42:20vstinnersetmessages: + msg120628
2010-11-06 14:31:04eric.araujosetnosy: + eric.araujo
messages: + msg120616
2010-11-06 10:49:38vstinnercreate