msg113849 - (view) |
Author: Alexander Belopolsky (belopolsky) *  |
Date: 2010-08-13 23:06 |
For example:
$ ./python.exe Tools/scripts/ Modules/_heapqmodule.c
Traceback (most recent call last):
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe7 in position 173: invalid continuation byte
I am not sure what relevant C standard has to say about using non-ascii characters in comments, but the checking tool should not fail with a traceback in such situation.
msg114198 - (view) |
Author: PCManticore (Claudiu.Popa) *  |
Date: 2010-08-18 06:01 |
As it seems, opens the file using the builtin function open, making the call error-prone when encountering non-ascii character. The proper handling should be done by using open from codecs library, specifying the encoding as argument.
e.g., mode, 'utf-8') instead of simply open(filename, mode).
msg114948 - (view) |
Author: Éric Araujo (eric.araujo) *  |
Date: 2010-08-25 23:26 |
The builtin open in 3.2 is similar to If you read the error message closely, you’ll see that the decoding that failed did try to use UTF-8.
The cause of the problem here is that the bytes used for the ç in François’ name are not valid UTF-8; I can fix that. This does not change the original purpose of this report: untabify should not die.
msg115517 - (view) |
Author: Éric Araujo (eric.araujo) *  |
Date: 2010-09-03 22:14 |
Fixed encoding error in r84472 through r84474.
This bug should be reassessed and retitled. If untabify fails because a file has an incorrect encoding, is it really a problem in untabify? This is a developer’s tool, so getting a traceback here seems okay to me. Alexander, please close if you agree.
msg115527 - (view) |
Author: Alexander Belopolsky (belopolsky) *  |
Date: 2010-09-03 22:47 |
> If untabify fails because a file has an incorrect encoding, is it really
> a problem in untabify? This is a developer’s tool, so getting a
> traceback here seems okay to me.
I disagree. I think we should use this opportunity to clarify preferred encoding for C language source files in python and make untabify produce meaningful diagnostic in case of encoding errors.
As a matter of policy, I see two possibilities:
1. Restrict C sources to 7-bit ASCII. (A pedantic reading of ANSI C standard would probably suggest even more restricted character set, but practically, I don't think 7-bit ASCII in C comments is likely to cause problems for any tools.
2. Require UTF-8 encoding for non-ASCII characters. Given that this is the default for python source code, it is likely that tools that are used for python development can handle UTF-8.
My vote is for #1. Display of non-ascii characters is still not universally supported and they are likely to be clobbered when diffs are copied in e-mails etc.
msg115534 - (view) |
Author: Éric Araujo (eric.araujo) *  |
Date: 2010-09-03 22:58 |
I agree about the need to define the encoding for comments. My vote goes to #2, since I wouldn’t want to see names of authors/contributors mangled in the source. I would reconsider if a specification explicitly forbade that.
I repeat that the title of this bug is misleading: untabify does not fail with non-ASCII bytes, it failed because of invalid bytes.
msg115540 - (view) |
Author: Alexander Belopolsky (belopolsky) *  |
Date: 2010-09-03 23:18 |
> I wouldn’t want to see names of authors/contributors mangled
> in the source.
This is a reason to write names in ASCII. While Latin-1 is a grey area because most of it's characters look familiar to English-speaking developers, I don't think you will easily recognize my name if I write it in Cyrillic and even if you do, chances are you would not be able to search for it. On the other hand, everyone who uses e-mail is likely to have a preferred ASCII spelling of his/her name.
msg115548 - (view) |
Author: Éric Araujo (eric.araujo) *  |
Date: 2010-09-04 00:25 |
>> I wouldn’t want to see names of authors/contributors mangled
>> in the source.
> This is a reason to write names in ASCII.
Oh, sorry, by “mangled” I meant “forced into ASCII”. I was not speaking about mojibake.
> While Latin-1 is a grey area because most of [its] characters look familiar
> to English-speaking developers,
I don’t think there is an argument for Latin-1. Also, Latin-1 does not have characters but bytes, which are displayed as characters by good editors, like UTF-8 bytes are. The discussion is about ASCII versus UTF-8 in my opinion, let Latin-1 rest in peace.
> I don't think you will easily recognize my name if I write it in Cyrillic
> and even if you do, chances are you would not be able to search for it.
Not so good example, since I’ve seen your name in the thread about Misc/ACKS sorting and could recognize it, by I get your idea :)
To search, I would use the “search for word under cursor” functionality.
> On the other hand, everyone who uses e-mail is likely to have a preferred
> ASCII spelling of his/her name.
Well, some languages have rules to handle constrained environments, like German who may use oe for ö or Italian E' for È, but for example in French there is no such workaround. Leaving accents out of words is a spelling error, nothing more or less. When I’m forced to change my name because of broken old tools I really feel the programmers behind the tool could do better. (I happen to have an ASCII-compatible nickname, which I prefer using to the ASCII-maimed version of my name where I can.)
I feel 2010 is very late to accept that we live in a wide world and that people should be able to just use their names with computer systems.
By the way, you still haven’t retitled this bug to address my other remark :)
msg115571 - (view) |
Author: Florent Xicluna (flox) *  |
Date: 2010-09-04 13:19 |
Other C files converted from latin-1 to utf-8 with r84485.
msg115824 - (view) |
Author: Alexander Belopolsky (belopolsky) *  |
Date: 2010-09-07 23:44 |
From IRC:
Me: UTF-8 was not strictly valid in ANSI C comments, so it is a bug in untabify to assume UTF-8 in C files.
Merwok: Works for me.
I am lowering the priority because it looks like untabify does not fail on the current code base. I'll follow up on python-dev to find out whether ASCII or UTF-8 should be enforced by untabify.
msg115828 - (view) |
Author: Éric Araujo (eric.araujo) *  |
Date: 2010-09-08 00:08 |
Why would it be the job of untabify to report invalid non-ASCII characters in C files?
msg115830 - (view) |
Author: Alexander Belopolsky (belopolsky) *  |
Date: 2010-09-08 00:29 |
On Tue, Sep 7, 2010 at 8:08 PM, Éric Araujo <> wrote:
> Why would it be the job of untabify to report invalid non-ASCII characters in C files?
Since untabify works by loading C code as text, it has to assume some
encoding. Failing with uncaught decode error (as it currently does
on non UTF-8 source) is not very user friendly. For example, the
diagnostic does not report the position of the offending character and
does not explain how to fix the source.
msg115831 - (view) |
Author: Éric Araujo (eric.araujo) *  |
Date: 2010-09-08 00:31 |
My real question was: Shouldn’t this be a VCS hook instead of untabify’s job? (or in addition to untabify if you insist)
msg115837 - (view) |
Author: Alexander Belopolsky (belopolsky) *  |
Date: 2010-09-08 01:11 |
On Tue, Sep 7, 2010 at 8:31 PM, Éric Araujo <> wrote:
> My real question was: Shouldn’t this be a VCS hook instead of untabify’s job? (or in addition to untabify if you insist)
Yes, VCS hook makes sense (and may almost eliminate the need to
handle invalid bytestreams in untabify). The hard question is still
the same, though: are non-ascii characters allowed in python C code?
My answer is "no".
msg115838 - (view) |
Author: Éric Araujo (eric.araujo) *  |
Date: 2010-09-08 01:13 |
I agree with your reply (that’s what I meant with “works for me”, the question about untabify vs. hooks only occurred to me after our IRC exchange).
msg122923 - (view) |
Author: Alexander Belopolsky (belopolsky) *  |
Date: 2010-11-30 17:32 |
Committed revision 86893 that makes respect encoding cookie in the files it processes. I don't think there is anything else that needs to be done here.
Date |
User |
Action |
Args |
2022-04-11 14:57:05 | admin | set | github: 53807 |
2010-12-30 22:14:16 | georg.brandl | unlink | issue7962 dependencies |
2010-11-30 17:32:22 | belopolsky | set | status: open -> closed resolution: fixed messages:
+ msg122923
stage: resolved |
2010-09-08 01:13:13 | eric.araujo | set | messages:
+ msg115838 |
2010-09-08 01:11:42 | belopolsky | set | messages:
+ msg115837 |
2010-09-08 00:31:53 | eric.araujo | set | messages:
+ msg115831 |
2010-09-08 00:29:52 | belopolsky | set | messages:
+ msg115830 |
2010-09-08 00:08:29 | eric.araujo | set | messages:
+ msg115828 |
2010-09-07 23:44:29 | belopolsky | set | priority: normal -> low assignee: belopolsky messages:
+ msg115824
2010-09-04 13:19:45 | flox | set | nosy:
+ flox messages:
+ msg115571 components:
+ Unicode
2010-09-04 00:25:48 | eric.araujo | set | messages:
+ msg115548 |
2010-09-03 23:18:08 | belopolsky | set | messages:
+ msg115540 |
2010-09-03 22:58:57 | eric.araujo | set | messages:
+ msg115534 |
2010-09-03 22:47:05 | belopolsky | set | messages:
+ msg115527 |
2010-09-03 22:14:10 | eric.araujo | set | nosy:
+ pitrou messages:
+ msg115517
2010-08-25 23:26:23 | eric.araujo | set | messages:
+ msg114948 |
2010-08-18 06:01:53 | Claudiu.Popa | set | nosy:
+ Claudiu.Popa messages:
+ msg114198
2010-08-13 23:08:48 | belopolsky | set | nosy:
+ eric.araujo
2010-08-13 23:06:43 | belopolsky | link | issue7962 dependencies |
2010-08-13 23:06:12 | belopolsky | create | |