This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Tools/scripts/reindent.py fails on non-UTF-8 encodings
Type: behavior Stage: resolved
Components: Demos and Tools Versions: Python 3.2, Python 3.3
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: belopolsky, christian.heimes, eric.araujo, flox, georg.brandl, iritkatriel, serhiy.storchaka, tim.peters, vstinner
Priority: normal Keywords: needs review, patch

Created on 2010-10-15 16:54 by belopolsky, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
reindent.diff belopolsky, 2010-10-15 16:54 review
reindent_coding.py vstinner, 2011-07-07 23:25 review
Messages (13)
msg118804 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010-10-15 16:54
Tools/scripts/reindent.py -d Lib/test/encoded_modules/module_koi8_r.py
Traceback (most recent call last):
  File "Tools/scripts/reindent.py", line 310, in <module>
    main()
  File "Tools/scripts/reindent.py", line 93, in main
    check(arg)
  File "Tools/scripts/reindent.py", line 114, in check
    r = Reindenter(f)
  File "Tools/scripts/reindent.py", line 162, in __init__
    self.raw = f.readlines()
  File "Lib/codecs.py", line 300, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xf0 in position 59: invalid continuation byte

Attached patch fixes this issue.
msg118810 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2010-10-15 17:45
+1.
msg118812 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2010-10-15 17:53
LGTM.
msg119026 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010-10-18 14:48
Committed in r85695.  Leaving open to discuss whether anything can/should be done for the case when reindent acts as an stdin to stdout filter.  Also, what is the policy on backporting Tools' bug fixes?
msg119276 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2010-10-21 11:44
When working as a filter, reindent should use sys.{stdin,stdout}.encoding (defaulting to sys.getdefaultencoding()) for reading and writing, respectively.  Detecting encoding on streams is not worth it IMO.  People can set PYTHONIOENCODING for baroque needs.
msg139967 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-07-07 10:50
> Leaving open to discuss whether anything can/should be done
> for the case when reindent acts as an stdin

sys.stdin.buffer and sys.stdout.buffer should be used with tokenize.detect_encoding(). We may read first stdin and write it into a BytesIO object to be able to rewind after detect_encoding. Something like:

content = sys.stdin.buffer.read()
raw = io.BytesIO(content)
buffer = io.BufferedReader(raw)
encoding, _ = detect_encoding(buffer.readline)
buffer.seek(0)
text = TextIOWrapper(buffer, encoding)
# use text
msg140001 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-07-07 23:25
reindent_coding.py: patch fixing reindent.py when using pipes (stdin and stdout).
msg140003 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2011-07-07 23:43
This is a lot more code than what I’d have expected.

What is your opinion on my previous message?
msg140005 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-07-07 23:47
> When working as a filter, reindent should use sys.{stdin,stdout}.encoding
> (defaulting to sys.getdefaultencoding()) for reading and writing,
> respectively.

It just doesn't work: you cannot read a ISO-8859-1 file from UTF-8 (if your locale encoding is UTF-8).
msg140021 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2011-07-08 11:19
Even with PYTHONIOENCODING?
msg315607 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-04-22 11:15
I concur with Éric. Standard input and output are text streams in Python 3. The user can control their encoding by setting locale or PYTHONIOENCODING.

I think this issue can be closed now unless somebody want to backport the fix to 2.7.
msg377111 - (view) Author: Irit Katriel (iritkatriel) * (Python committer) Date: 2020-09-18 12:04
Since there won't be a python 2.7 backport, should this issue be closed?
msg377114 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-09-18 12:58
> Committed in r85695.  Leaving open to discuss whether anything can/should be done for the case when reindent acts as an stdin to stdout filter.  Also, what is the policy on backporting Tools' bug fixes?

This is the commit:

commit 4a98e3b6d06e5477e5d62f18e85056cbb7253f98
Author: Alexander Belopolsky <alexander.belopolsky@gmail.com>
Date:   Mon Oct 18 14:43:38 2010 +0000

    Issue #10117: Tools/scripts/reindent.py now accepts source files that
    use encoding other than ASCII or UTF-8.  Source encoding is preserved
    when reindented code is written to a file.


> Since there won't be a python 2.7 backport, should this issue be closed?

Right, 2.7 branch is closed. I close the issue.
History
Date User Action Args
2022-04-11 14:57:07adminsetgithub: 54326
2020-09-18 12:58:23vstinnersetstatus: open -> closed
resolution: fixed
messages: + msg377114

stage: resolved
2020-09-18 12:04:38iritkatrielsetstatus: pending -> open
nosy: + iritkatriel
messages: + msg377111

2018-04-22 11:15:28serhiy.storchakasetstatus: open -> pending

messages: + msg315607
2012-10-13 23:02:01serhiy.storchakasetnosy: + serhiy.storchaka
2011-07-08 11:19:44eric.araujosetmessages: + msg140021
2011-07-07 23:47:27vstinnersetmessages: + msg140005
2011-07-07 23:43:16eric.araujosetmessages: + msg140003
2011-07-07 23:25:02vstinnersetfiles: + reindent_coding.py

messages: + msg140001
versions: + Python 3.3
2011-07-07 10:50:01vstinnersetnosy: + vstinner
messages: + msg139967
2010-10-21 11:44:18eric.araujosetmessages: + msg119276
2010-10-18 14:48:10belopolskysetmessages: + msg119026
2010-10-15 17:53:44georg.brandlsetnosy: + georg.brandl
messages: + msg118812
2010-10-15 17:45:26eric.araujosetnosy: + eric.araujo
messages: + msg118810
2010-10-15 16:56:41belopolskysetnosy: + tim.peters, christian.heimes, flox
2010-10-15 16:54:39belopolskycreate