msg120600 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2010-11-06 10:49 |
In Python3, the following pattern becomes common:
with open(fullname, 'rb') as fp:
coding, line = tokenize.detect_encoding(fp.readline)
with open(fullname, 'r', encoding=coding) as fp:
...
It opens the file is opened twice, whereas it is unnecessary: it's possible to reuse the raw buffer to create a text file. And I don't like the detect_encoding() API: pass the readline function is not intuitive.
I propose to create tokenize.open_python() function with a very simple API: just one argument, the filename. This function calls detect_encoding() and only open the file once.
Attached python adds the function with an unit test and a patch on the documentation. It patchs also functions currently using detect_encoding().
open_python() only supports read mode. I suppose that it is enough.
|
msg120616 - (view) |
Author: Éric Araujo (eric.araujo) * |
Date: 2010-11-06 14:31 |
Looks good.
In the test, do you have to remove the TESTFN file manually? Isn’t that handled by unittest or regrtest?
I think the name of the function could be better but I don’t have a proposal.
|
msg120628 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2010-11-06 15:42 |
> In the test, do you have to remove the TESTFN file manually? Isn’t that
> handled by unittest or regrtest?
The test uses TESTFN+'.py', not TESTFN ;-)
> I think the name of the function could be better but I don’t have a
> proposal.
Do you prefer a name like tokenize.open()?
|
msg120629 - (view) |
Author: Éric Araujo (eric.araujo) * |
Date: 2010-11-06 15:44 |
> The test uses TESTFN+'.py', not TESTFN ;-)
Ah, right. You can use addCleanup then.
> Do you prefer a name like tokenize.open()?
Hm, tokenize being “Tokenization help for Python programs”, I think this name is rather good. Or tokenize.open_source.
|
msg120632 - (view) |
Author: Alexander Belopolsky (belopolsky) * |
Date: 2010-11-06 16:00 |
+1 on adding the function.
Note that it is useful for opening any text file with an encoding cookie, not only python source code, so "tokenize.open()" sounds attractive.
Once we are at it, I would like to factor out and document code that extracts the cookie from a decoded string. This functionality is useful for tools that produce annotated source code and want to preserve source encoding in their output. See issue10329.
Also, to answer Victor's comment on issue10329, msg120601, while using this function simplifies dealing with source files in trace module, it does not solve all the issues: source file will still be opened separately in linecache.gettines() and find_executable_linenos() and find_executable_linenos() still does not respect __loader__ interface. See issue 10342.
|
msg120635 - (view) |
Author: Éric Araujo (eric.araujo) * |
Date: 2010-11-06 16:15 |
> Note that it is useful for opening any text file with an encoding
> cookie, not only python source code,
This is cool. doctest could benefit from that, I think.
> so "tokenize.open()" sounds attractive.
Agreed, even though the docstring of tokenize does talk about Python code.
|
msg120637 - (view) |
Author: Alexander Belopolsky (belopolsky) * |
Date: 2010-11-06 16:25 |
On Sat, Nov 6, 2010 at 12:15 PM, Éric Araujo <report@bugs.python.org> wrote:
..
>> so "tokenize.open()" sounds attractive.
> Agreed, even though the docstring of tokenize does talk about Python code.
I still like "tokenize.open()" more, but for the sake of completeness
of discussion, what about open(.., encoding="fromcookie")? (I know,
tokenize.detect_encoding() also looks for an utf-8 BOM, so
"fromcookie" can be improved.)
|
msg120638 - (view) |
Author: Éric Araujo (eric.araujo) * |
Date: 2010-11-06 16:32 |
I don’t know if it would be okay to depend on tokenize in io. Anyway, I tend to prefer tokenize.open over builtins.open(..., encoding='fromcookie').
|
msg120643 - (view) |
Author: Georg Brandl (georg.brandl) * |
Date: 2010-11-06 18:00 |
+1 for the feature. However, tokenize.open() sounds a bit unspecific. But I don't really have better suggestions; open_cookie() is wrong too, since it doesn't open cookies :)
|
msg120661 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2010-11-07 09:04 |
> what about open(.., encoding="fromcookie")?
Please don't do that. It remembers me the magical str.encode() method of
Python2. I prefer simple API with limited features: if you want to open a
Python script, use tokenize.open(), if you want to open an XML document use
any XML library, etc. It remembers me also the discussion about detecting BOM
in text files: issue #7651. There was no consensus, and so the issue is still
open.
For this particular issue, I prefer a specific function tokenize.<choose a
function name>.
|
msg120662 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2010-11-07 09:24 |
On Saturday 06 November 2010 17:00:15 you wrote:
> Note that it is useful for opening any text file with an encoding cookie,
> not only python source code, so "tokenize.open()" sounds attractive.
Ok, the new patch (tokenize_open-2.patch) uses tokenize.open() name and adds a
test for BOM without coding cookie (test utf-8-sig encoding).
|
msg120755 - (view) |
Author: Alexander Belopolsky (belopolsky) * |
Date: 2010-11-08 15:28 |
On Sun, Nov 7, 2010 at 4:24 AM, STINNER Victor <report@bugs.python.org> wrote:
..
> Ok, the new patch (tokenize_open-2.patch) uses tokenize.open() name and adds a
> test for BOM without coding cookie (test utf-8-sig encoding).
Here are my comments on the new patch:
1. A nit-pick: tokenize.py is already inconsistent in this respect,
but I believe it is conventional not to start a docsting with an empty
line:
+def open(filename):
+ """Open a Python script in read mode with the right encoding.
instead of
+def open(filename):
+ """
+ Open a Python script in read mode with the right encoding.
Also, I would prefer a more neutral description that will not imply
that the method is only useful for Python source code. For example:
"""Detect encoding and open a file in read mode with the right encoding.
(optionally a more detailed explanation or a reference to detect_encoding.)
"""
2. Another nit-pick: detect_encoding() returns a list of 0-2 lines in
the second item, not a single line, so it is better to write
+ encoding, lines = detect_encoding(buffer.readline)
or even
+ encoding, _ = detect_encoding(buffer.readline)
instead of
+ encoding, line = detect_encoding(buffer.readline)
3. In test_open(self), I think you should use self.assertEqual()
instead of a plain assert.
4. Last nit-pick (feel free to ignore as prejudice to %-formatting :-):
instead of
+ print("# coding: %s" % encoding, file=fp)
you can write
+ print("# coding: " + encoding, file=fp)
|
msg120756 - (view) |
Author: Alexander Belopolsky (belopolsky) * |
Date: 2010-11-08 15:34 |
Changing the title to make the latest choice of function name more visible.
|
msg120758 - (view) |
Author: Éric Araujo (eric.araujo) * |
Date: 2010-11-08 15:35 |
1. Agreed on both. Extraneous blank lines seem to be favored by people who need to work around bugs in their editor.
2. Yes, “_” is recommended for a discarded object.
3. Agreed.
4. Even better: print('# coding:', encoding, file=fp). No intermediary string!
|
msg120831 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2010-11-09 01:12 |
Commited to Python 3.2 (r86346 + r86347). Thanks for your reviews.
|
|
Date |
User |
Action |
Args |
2022-04-11 14:57:08 | admin | set | github: 54544 |
2010-11-09 01:12:02 | vstinner | set | status: open -> closed resolution: fixed messages:
+ msg120831
|
2010-11-08 15:35:39 | eric.araujo | set | messages:
+ msg120758 title: tokenize.open(): open a Python file with the right encoding -> tokenize.open(): open a file with encoding detected from a coding cookie |
2010-11-08 15:34:35 | belopolsky | set | type: enhancement title: tokenize.open_python(): open a Python file with the right encoding -> tokenize.open(): open a Python file with the right encoding messages:
+ msg120756 stage: patch review |
2010-11-08 15:28:13 | belopolsky | set | messages:
+ msg120755 |
2010-11-07 09:24:45 | vstinner | set | files:
+ tokenize_open-2.patch
messages:
+ msg120662 |
2010-11-07 09:04:47 | vstinner | set | messages:
+ msg120661 |
2010-11-06 18:00:59 | georg.brandl | set | nosy:
+ georg.brandl messages:
+ msg120643
|
2010-11-06 16:32:00 | eric.araujo | set | nosy:
+ pitrou messages:
+ msg120638
|
2010-11-06 16:25:31 | belopolsky | set | messages:
+ msg120637 |
2010-11-06 16:15:12 | eric.araujo | set | messages:
+ msg120635 |
2010-11-06 16:00:12 | belopolsky | set | nosy:
+ belopolsky messages:
+ msg120632
|
2010-11-06 15:44:13 | eric.araujo | set | messages:
+ msg120629 |
2010-11-06 15:42:20 | vstinner | set | messages:
+ msg120628 |
2010-11-06 14:31:04 | eric.araujo | set | nosy:
+ eric.araujo messages:
+ msg120616
|
2010-11-06 10:49:38 | vstinner | create | |