Issue21927
This issue tracker has been migrated to GitHub,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2014-07-06 14:39 by jaraco, last changed 2022-04-11 14:58 by admin. This issue is now closed.
Messages (16) | |||
---|---|---|---|
msg222406 - (view) | Author: Jason R. Coombs (jaraco) * | Date: 2014-07-06 14:39 | |
Consider this simple example in Powershell (Windows 8.1): C:\Users\jaraco> cat .\print-input.py import sys print(next(sys.stdin)) C:\Users\jaraco> echo foo | .\print-input.py foo The BOM (byte order mark) appears in the standard input stream. When using cmd.exe, the BOM is not present. This behavior occurs in CP1252 as well as CP65001. I suspect that Python should be detecting/stripping and possibly honoring the BOM when decoding input on stdin. This issue is present in Python 3.4.0 and Python 3.4.1. I have not tested other Python versions. |
|||
msg222559 - (view) | Author: Ezio Melotti (ezio.melotti) * | Date: 2014-07-08 13:24 | |
I would argue that adding the BOM is a Powershell issue, and I'm not sure Python should do anything about it. There are probably cases where people expects the BOM to be received by python, so stripping it is probably not an option. As for detecting, it should happen automatically only if sys.stdin.encoding is set to 'utf-8-bom', but, by default, Python 3 uses 'UTF-8'. |
|||
msg222683 - (view) | Author: Jason R. Coombs (jaraco) * | Date: 2014-07-10 18:23 | |
I'm not sure what you're suggesting. Are you suggesting that Powershell is wrong here and that Powershell's attempt here to provide more detail about content encoding is wrong? Or are you suggesting that every client that reads from stdin should detect that it's running in Powershell or otherwise handle the BOM individually? From my perspective, Powershell is innovating here and providing additional detail about the encoding of the content, but since Python is responsible for the content decoding (especially Python 3), it should honor that detail. I did some tests and determined that 'utf-8-sig' will honor the bom if present and ignore it if missing. Is there any reason Python shouldn't simply use that encoding for decoding stdin? |
|||
msg222740 - (view) | Author: Jason R. Coombs (jaraco) * | Date: 2014-07-11 13:01 | |
I've tested it and setting PYTHONIOENCODING='utf-8-sig' starts to get there. It causes Python to consume the BOM on stdin, but it also causes stdout to print a spurious non-printable character in the output: C:\Users\jaraco> echo foo | ./print-input foo There is a non-printable character before foo. I've included it in this message. In Powershell, it's rendered with a square before foo: □foo Using PowerShell under ConEmu, it appears as a space: foo In cmd.exe, I see this: C:\Users\jaraco>python -c "print('foo')" ∩╗┐foo The space before the 'foo' apparently isn't a space at all. Indeed, the input is being processed as desired, but the output now is not. C:\Users\jaraco> python -c "print('bar')" bar (the non-printable character appears there too) If I copy that text to the clipboard, I find that character is actually a \ufeff (zero-width no-break space, aka byte order mark). So by setting the environment variable to use utf-8-sig for input, it simultaneously changes the output to also use utf-8-sig. So it appears as if setting the environment variable would work for my purposes except that I only want to alter the input encoding and not the output encoding. I think my goal is pretty basic - read text from standard input and write text to standard output on the primary shell included with the most popular operating system. I contend that goal should be easily achieved and straightforward on Python out of the box. What does everyone think of the proposal that Python should simply default to utf-8-sig instead of utf-8 for stdin encoding? |
|||
msg222743 - (view) | Author: STINNER Victor (vstinner) * | Date: 2014-07-11 14:04 | |
> The BOM (byte order mark) appears in the standard input stream. When using cmd.exe, the BOM is not present. This behavior occurs in CP1252 as well as CP65001. How you do change the console encoding? Using the chcp command? I'm surprised that you get a UTF-8 BOM when the code page 1252 is used. Can you please check that sys.stdin.encoding is "cp1252"? I tested PowerShell with Python 3.5 on Windows 7 with an OEM code page 850 and ANSI code page 1252: - by default, the stdin encoding is cp850 (OEM code page) and os.device_encoding(0) returns "cp850". sys.stdin.readline() does not contain a BOM. - when stdin is a pipe (ex: echo "abc"|python ...), the stdin encoding becomes cp1252 (ANSI code page) because os.device_encoding(0) returns None; cp1252 is the result of locale.getpreferredencoding(False) (ANSI code page). sys.stdin.readline() does not contain a BOM. If I change the console encoding using the command "chcp 65001": - by default, the stdin encoding = os.device_encoding(0) = "cp65001". sys.stdin.readline() does not contain a BOM. - when stdin is a pipe, stdin encoding = locale.getpreferredencoding(False) = "cp1252" and sys.stdin.readline() *contains* the UTF-8 BOM Note: The UTF-8 BOM is only written once, before the first character. So the UTF-8 BOM is only written in one case under these conditions: - Python is running in PowerShell (The UTF-8 BOM is not written in cmd.exe, even with chcp 65001) - sys.stdin is a pipe - the console encoding was set manually to cp65001 -- It looks like PowerShell decodes the output of the producer program (echo, type, ...) and then encodes the output to the consumer program (ex: python). It's possible to change the encoding of the encoder by setting $OutputEncoding variable. Example to encode to UTF-8 without the BOM: $OutputEncoding = New-Object System.Text.UTF8Encoding($False) Example to encode to UTF-8 without the BOM: $OutputEncoding = [System.Text.Encoding]::UTF8 Using [System.Text.Encoding]::UTF8, sys.stdin.readline() starts with a BOM even if the console encoding is cp850. If you set the console encoding to 65001 (chcp 65001) and $OutputEncoding to [System.Text.Encoding]::UTF8, you get... two UTF-8 BOMs... yeah! I tried different producer programs: [MS-DOS] echo "abc", [PowerShell] write-output "abc", [MS-DOS] type document.txt, [PowerShell] Get-Content document.txt, python -c "print('abc')". It doesn't like like using a different program changes anything. The UTF-8 BOM is added somewhere by PowerShell between by producer and the consumer programs. To show the console input and output encodings in PowerShell, type "[console]::InputEncoding" and "[console]::OutputEncoding". See also: http://stackoverflow.com/questions/22349139/utf8-output-from-powershell |
|||
msg222748 - (view) | Author: STINNER Victor (vstinner) * | Date: 2014-07-11 14:27 | |
See also issues #1602 (Windows console) and #16587 (stdin, _setmode() and wprintf). I tried msvcrt.setmode(0, 0x40000): set stdin mode to _O_U8TEXT. In this mode, echo "abc"|python -c "import sys; print(ascii(sys.stdin.read()))" displays "\xff\xfea\x00b\x00c\x00\n\x00" which is "abc" encoded to UTF-16 (little endian with the BOM), b'\xff\xfe' is the Unicode BOM U+FEFF (u'\uFEFF') encoded to UTF-16-LE. U+FEFF encoded to UTF-8 gives b'\xef\xbb\xbf'. So it looks like it's not an issue of the stdin mode. I tried all modes and I always get the Unicode BOM. |
|||
msg222761 - (view) | Author: Jason R. Coombs (jaraco) * | Date: 2014-07-11 16:15 | |
I get different results that @haypo when testing Powershell on Windows 8.1 with Python 3.4.1: C:\Users\jaraco> chcp 1252 Active code page: 1252 C:\Users\jaraco> $env:PYTHONIOENCODING='' > How you do change the console encoding? Using the chcp command? Yes. I recently discovered that if I use chcp 65001 in my Powershell profile, I can finally see Unicode characters output from my Python programs! > I'm surprised that you get a UTF-8 BOM when the code page 1252 is used. Can you please check that sys.stdin.encoding is "cp1252"? C:\Users\jaraco> echo foo | python -c "import sys; print(sys.stdin.readline())" foo C:\Users\jaraco> python -c "import sys; print(sys.stdin.encoding)" cp1252 C:\Users\jaraco> chcp 65001 C:\Users\jaraco> echo foo | python -c "import locale, os; print(os.device_encoding(0), locale.getpreferredencoding(False))" None cp1252 It seems as if something may have changed in Powershell between Windows 7 and 8.1, because my results are inconsistent with your findings. There's a lot more to digest from your response, so I'll going to have to revisit this later. |
|||
msg223129 - (view) | Author: R. David Murray (r.david.murray) * | Date: 2014-07-15 17:40 | |
I find it amusing that the complaint is that Python isn't detecting the BOM and using the info when powershell produces it, but when python produces the BOM, it is powershell that isn't detecting it and using the information. So it looks like there's a bug here in powershell no matter how you look at it ;) |
|||
msg223157 - (view) | Author: Jason R. Coombs (jaraco) * | Date: 2014-07-15 23:12 | |
I agree there appears to be an inconsistency in how Powershell handles pipes between child processes and between itself and child processes. I'm not complaining about Python, but rather trying to find the best practice here. I'm currently using PYTHONIOENCODING='utf-8-sig' and I've been mostly satisfied with the results. I get the spurious BOM appearing on output, but at least as you say that doesn't seem like a Python problem (at least that has been identified). |
|||
msg223175 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * | Date: 2014-07-16 06:07 | |
> - when stdin is a pipe (ex: echo "abc"|python ...), the stdin encoding becomes cp1252 (ANSI code page) because os.device_encoding(0) returns None; cp1252 is the result of locale.getpreferredencoding(False) (ANSI code page). sys.stdin.readline() does not contain a BOM. What if echo non-ascii characters? How they are encoded? Perhaps Python should detect when it is ran under PowerShell in a pipe and set stdin (and/or stdout and stderr) encoding to CP65001). |
|||
msg223189 - (view) | Author: Jason R. Coombs (jaraco) * | Date: 2014-07-16 11:43 | |
Here I use the british pound symbol to attempt to answer that question. I've disabled the environment variable PYTHONIOENCODING and not set any code page or loaded any other Powershell profile settings. PS C:\Users\jaraco> echo £ £ PS C:\Users\jaraco> chcp Active code page: 437 PS C:\Users\jaraco> echo £ | py -3 -c "import sys; print(repr(sys.stdin.read()))" '?\n' PS C:\Users\jaraco> chcp 65001 Active code page: 65001 PS C:\Users\jaraco> echo £ | py -3 -c "import sys; print(repr(sys.stdin.read()))" '?\n' PS C:\Users\jaraco> echo £ | py -3 -c "import sys; print(repr(sys.stdin.buffer.read()))" b'?\r\n' Curiously, it appears as if powershell is actually receiving a question mark from the pipe. |
|||
msg223192 - (view) | Author: STINNER Victor (vstinner) * | Date: 2014-07-16 12:07 | |
Please use ascii() instead of repr() in your test to identify who replaces characters with question marks. |
|||
msg223194 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * | Date: 2014-07-16 12:30 | |
Bytes repr doesn't contains non-ascii characters, therefore Python is actually receiving a question mark from the pipe. What are results of following commands? py -3 -c "import sys; sys.stdout.buffer.write(bytes(range(128, 256)))" py -3 -c "import sys; sys.stdout.buffer.write(bytes(range(128, 256)))" | py -3 -c "import sys; b = sys.stdin.buffer.read(); print(len(b), b)" |
|||
msg223238 - (view) | Author: Eryk Sun (eryksun) * | Date: 2014-07-16 17:43 | |
> PS C:\Users\jaraco> echo £ | py -3 -c "import sys; print(repr(sys.stdin.buffer.read()))" > b'?\r\n' > Curiously, it appears as if powershell is actually receiving > a question mark from the pipe. PowerShell calls ReadConsoleW to read the console input buffer, i.e. it reads "£" as a wide character from the command line. The default encoding when writing to the pipe should be ASCII [*]. If that's the case it explains the question mark that Python reads from stdin. It's the default replacement character (WC_DEFAULTCHAR) used by WideCharToMultiByte. [*] http://blogs.msdn.com/b/powershell/archive/2006/12/11/outputencoding-to-the-rescue.aspx You can change PowerShell's output encoding to match the console: $OutputEncoding = [Console]::OutputEncoding If the console codepage is 65001, the above is equivalent to setting $OutputEncoding = [System.Text.Encoding]::UTF8 http://msdn.microsoft.com/en-us/library/system.text.encoding.utf8 As Victor mentioned, this setting always writes a BOM, and under codepage 65001 it actually writes 2 BOMs (at least in PowerShell 2). Victor also mentioned that you can avoid the BOM by passing $False to the constructor: $OutputEncoding = New-Object System.Text.UTF8Encoding($False) http://msdn.microsoft.com/en-us/library/system.text.utf8encoding There's still a BOM under codepage 65001, but maybe that's fixed in PowerShell 3. I avoid setting the console to codepage 65001 anyway. ReadFile/WriteFile incorrectly return the number of characters read/written instead of the number of bytes because the call is actually handled by ReadConsoleA/WriteConsoleA. Maybe that's finally fixed in Windows 8. |
|||
msg378486 - (view) | Author: Eryk Sun (eryksun) * | Date: 2020-10-12 12:08 | |
I'm closing this as a third-party issue with older versions of PowerShell. Newer versions of PowerShell set the output encoding to UTF-8 without a BOM preamble. For example: PS C:\> $PSVersionTable.PSVersion Major Minor Patch PreReleaseLabel BuildLabel ----- ----- ----- --------------- ---------- 7 0 3 PS C:\> $OutputEncoding.EncodingName Unicode (UTF-8) PS C:\> echo ¡¢£¤¥ | py -3 -X utf8 -c "print(ascii(input()))" '\xa1\xa2\xa3\xa4\xa5' It's still possible to manually set the output encoding to include a BOM preamble. For example: PS C:\> $OutputEncoding = [System.Text.Encoding]::UTF8 PS C:\> $OutputEncoding.GetPreamble() 239 187 191 PS C:\> echo ¡¢£¤¥ | py -3 -X utf8 -c "print(ascii(input()))" '\ufeff\xa1\xa2\xa3\xa4\xa5' I don't know what would be appropriate for Python's I/O stack in terms of detecting and handling a UTF-8 preamble on any type of file (console/terminal, pipe, disk), i.e. using the "utf-8-sig" encoding instead of "utf-8", as opposed to just letting scripts detect and handle an initial BOM character (U+FEFF) however they see fit. But that discussion needs a new issue if people are interested in supporting new behavior. |
|||
msg378503 - (view) | Author: Jason R. Coombs (jaraco) * | Date: 2020-10-12 15:29 | |
Thanks Eryk for following up. Glad to hear the issue has been fixed upstream! |
History | |||
---|---|---|---|
Date | User | Action | Args |
2022-04-11 14:58:05 | admin | set | github: 66126 |
2020-10-12 15:29:06 | jaraco | set | messages: + msg378503 |
2020-10-12 12:08:08 | eryksun | set | status: open -> closed resolution: third party messages: + msg378486 stage: resolved |
2014-07-16 17:43:08 | eryksun | set | nosy:
+ eryksun messages: + msg223238 |
2014-07-16 12:30:54 | serhiy.storchaka | set | messages: + msg223194 |
2014-07-16 12:07:05 | vstinner | set | messages: + msg223192 |
2014-07-16 11:43:31 | jaraco | set | messages: + msg223189 |
2014-07-16 06:07:10 | serhiy.storchaka | set | nosy:
+ serhiy.storchaka messages: + msg223175 |
2014-07-15 23:12:26 | jaraco | set | messages: + msg223157 |
2014-07-15 17:40:19 | r.david.murray | set | nosy:
+ r.david.murray messages: + msg223129 |
2014-07-11 16:15:06 | jaraco | set | messages: + msg222761 |
2014-07-11 14:27:46 | vstinner | set | messages: + msg222748 |
2014-07-11 14:04:50 | vstinner | set | messages: + msg222743 |
2014-07-11 13:01:26 | jaraco | set | messages: + msg222740 |
2014-07-10 18:23:09 | jaraco | set | messages: + msg222683 |
2014-07-08 13:24:22 | ezio.melotti | set | nosy:
+ lemburg, loewis type: behavior messages: + msg222559 |
2014-07-06 14:39:22 | jaraco | create |