classification
Title: BOM appears in stdin when using Powershell
Type: behavior Stage:
Components: Unicode, Windows Versions: Python 3.4
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: eryksun, ezio.melotti, haypo, jason.coombs, lemburg, loewis, r.david.murray, serhiy.storchaka
Priority: normal Keywords:

Created on 2014-07-06 14:39 by jason.coombs, last changed 2014-07-16 17:43 by eryksun.

Messages (14)
msg222406 - (view) Author: Jason R. Coombs (jason.coombs) * (Python committer) Date: 2014-07-06 14:39
Consider this simple example in Powershell (Windows 8.1):

C:\Users\jaraco> cat .\print-input.py
import sys
print(next(sys.stdin))

C:\Users\jaraco> echo foo | .\print-input.py
foo

The BOM (byte order mark) appears in the standard input stream. When using cmd.exe, the BOM is not present. This behavior occurs in CP1252 as well as CP65001.

I suspect that Python should be detecting/stripping and possibly honoring the BOM when decoding input on stdin.

This issue is present in Python 3.4.0 and Python 3.4.1. I have not tested other Python versions.
msg222559 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2014-07-08 13:24
I would argue that adding the BOM is a Powershell issue, and I'm not sure Python should do anything about it.
There are probably cases where people expects the BOM to be received by python, so stripping it is probably not an option.
As for detecting, it should happen automatically only if sys.stdin.encoding is set to 'utf-8-bom', but, by default, Python 3 uses 'UTF-8'.
msg222683 - (view) Author: Jason R. Coombs (jason.coombs) * (Python committer) Date: 2014-07-10 18:23
I'm not sure what you're suggesting. Are you suggesting that Powershell is wrong here and that Powershell's attempt here to provide more detail about content encoding is wrong? Or are you suggesting that every client that reads from stdin should detect that it's running in Powershell or otherwise handle the BOM individually?

From my perspective, Powershell is innovating here and providing additional detail about the encoding of the content, but since Python is responsible for the content decoding (especially Python 3), it should honor that detail.

I did some tests and determined that 'utf-8-sig' will honor the bom if present and ignore it if missing. Is there any reason Python shouldn't simply use that encoding for decoding stdin?
msg222740 - (view) Author: Jason R. Coombs (jason.coombs) * (Python committer) Date: 2014-07-11 13:01
I've tested it and setting PYTHONIOENCODING='utf-8-sig' starts to get there. It causes Python to consume the BOM on stdin, but it also causes stdout to print a spurious non-printable character in the output:

C:\Users\jaraco> echo foo | ./print-input
foo

There is a non-printable character before foo. I've included it in this message. In Powershell, it's rendered with a square before foo:

□foo

Using PowerShell under ConEmu, it appears as a space:

 foo

In cmd.exe, I see this:

C:\Users\jaraco>python -c "print('foo')"
foo


The space before the 'foo' apparently isn't a space at all.

Indeed, the input is being processed as desired, but the output now is not.

C:\Users\jaraco> python -c "print('bar')"
bar

(the non-printable character appears there too)

If I copy that text to the clipboard, I find that character is actually a \ufeff (zero-width no-break space, aka byte order mark). So by setting the environment variable to use utf-8-sig for input, it simultaneously changes the output to also use utf-8-sig. 

So it appears as if setting the environment variable would work for my purposes except that I only want to alter the input encoding and not the output encoding.

I think my goal is pretty basic - read text from standard input and write text to standard output on the primary shell included with the most popular operating system. I contend that goal should be easily achieved and straightforward on Python out of the box.

What does everyone think of the proposal that Python should simply default to utf-8-sig instead of utf-8 for stdin encoding?
msg222743 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2014-07-11 14:04
> The BOM (byte order mark) appears in the standard input stream. When using cmd.exe, the BOM is not present. This behavior occurs in CP1252 as well as CP65001.

How you do change the console encoding? Using the chcp command?

I'm surprised that you get a UTF-8 BOM when the code page 1252 is used. Can you please check that sys.stdin.encoding is "cp1252"?


I tested PowerShell with Python 3.5 on Windows 7 with an OEM code page 850 and ANSI code page 1252:

- by default, the stdin encoding is cp850 (OEM code page) and os.device_encoding(0) returns "cp850". sys.stdin.readline() does not contain a BOM.

- when stdin is a pipe (ex: echo "abc"|python ...), the stdin encoding becomes cp1252 (ANSI code page) because os.device_encoding(0) returns None; cp1252 is the result of locale.getpreferredencoding(False) (ANSI code page). sys.stdin.readline() does not contain a BOM.

If I change the console encoding using the command "chcp 65001":

- by default, the stdin encoding = os.device_encoding(0) = "cp65001".  sys.stdin.readline() does not contain a BOM.

- when stdin is a pipe, stdin encoding = locale.getpreferredencoding(False) = "cp1252" and sys.stdin.readline() *contains* the UTF-8 BOM

Note: The UTF-8 BOM is only written once, before the first character.

So the UTF-8 BOM is only written in one case under these conditions:

- Python is running in PowerShell (The UTF-8 BOM is not written in cmd.exe, even with chcp 65001)
- sys.stdin is a pipe
- the console encoding was set manually to cp65001

--

It looks like PowerShell decodes the output of the producer program (echo, type, ...) and then encodes the output to the consumer program (ex: python).

It's possible to change the encoding of the encoder by setting $OutputEncoding variable. Example to encode to UTF-8 without the BOM:

   $OutputEncoding = New-Object System.Text.UTF8Encoding($False)

Example to encode to UTF-8 without the BOM:

   $OutputEncoding = [System.Text.Encoding]::UTF8

Using [System.Text.Encoding]::UTF8, sys.stdin.readline() starts with a BOM even if the console encoding is cp850. If you set the console encoding to 65001 (chcp 65001) and $OutputEncoding to [System.Text.Encoding]::UTF8, you get... two UTF-8 BOMs... yeah!

I tried different producer programs: [MS-DOS] echo "abc", [PowerShell] write-output "abc", [MS-DOS] type document.txt, [PowerShell] Get-Content document.txt, python -c "print('abc')". It doesn't like like using a different program changes anything. The UTF-8 BOM is added somewhere by PowerShell between by producer and the consumer programs.

To show the console input and output encodings in PowerShell, type "[console]::InputEncoding" and "[console]::OutputEncoding".

See also:
http://stackoverflow.com/questions/22349139/utf8-output-from-powershell
msg222748 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2014-07-11 14:27
See also issues #1602 (Windows console) and #16587 (stdin, _setmode() and wprintf).

I tried msvcrt.setmode(0, 0x40000): set stdin mode to _O_U8TEXT. In this mode, echo "abc"|python -c "import sys; print(ascii(sys.stdin.read()))" displays "\xff\xfea\x00b\x00c\x00\n\x00" which is "abc" encoded to UTF-16 (little endian with the BOM),  b'\xff\xfe' is the Unicode BOM U+FEFF (u'\uFEFF') encoded to UTF-16-LE. U+FEFF encoded to UTF-8 gives b'\xef\xbb\xbf'.

So it looks like it's not an issue of the stdin mode. I tried all modes and I always get the Unicode BOM.
msg222761 - (view) Author: Jason R. Coombs (jason.coombs) * (Python committer) Date: 2014-07-11 16:15
I get different results that @haypo when testing Powershell on Windows 8.1 with Python 3.4.1:

C:\Users\jaraco> chcp 1252
Active code page: 1252
C:\Users\jaraco> $env:PYTHONIOENCODING=''
> How you do change the console encoding? Using the chcp command?

Yes. I recently discovered that if I use chcp 65001 in my Powershell profile, I can finally see Unicode characters output from my Python programs!

> I'm surprised that you get a UTF-8 BOM when the code page 1252 is used. Can you please check that sys.stdin.encoding is "cp1252"?

C:\Users\jaraco> echo foo | python -c "import sys; print(sys.stdin.readline())"
foo

C:\Users\jaraco> python -c "import sys; print(sys.stdin.encoding)"
cp1252

C:\Users\jaraco> chcp 65001

C:\Users\jaraco> echo foo | python -c "import locale, os; print(os.device_encoding(0), locale.getpreferredencoding(False))"
None cp1252

It seems as if something may have changed in Powershell between Windows 7 and 8.1, because my results are inconsistent with your findings.

There's a lot more to digest from your response, so I'll going to have to revisit this later.
msg223129 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-07-15 17:40
I find it amusing that the complaint is that Python isn't detecting the BOM and using the info when powershell produces it, but when python produces the BOM, it is powershell that isn't detecting it and using the information.  So it looks like there's a bug here in powershell no matter how you look at it ;)
msg223157 - (view) Author: Jason R. Coombs (jason.coombs) * (Python committer) Date: 2014-07-15 23:12
I agree there appears to be an inconsistency in how Powershell handles pipes between child processes and between itself and child processes.

I'm not complaining about Python, but rather trying to find the best practice here.

I'm currently using PYTHONIOENCODING='utf-8-sig' and I've been mostly satisfied with the results. I get the spurious BOM appearing on output, but at least as you say that doesn't seem like a Python problem (at least that has been identified).
msg223175 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-07-16 06:07
> - when stdin is a pipe (ex: echo "abc"|python ...), the stdin encoding becomes cp1252 (ANSI code page) because os.device_encoding(0) returns None; cp1252 is the result of locale.getpreferredencoding(False) (ANSI code page). sys.stdin.readline() does not contain a BOM.

What if echo non-ascii characters? How they are encoded?

Perhaps Python should detect when it is ran under PowerShell in a pipe and set stdin (and/or stdout and stderr) encoding to CP65001).
msg223189 - (view) Author: Jason R. Coombs (jason.coombs) * (Python committer) Date: 2014-07-16 11:43
Here I use the british pound symbol to attempt to answer that question. I've disabled the environment variable PYTHONIOENCODING and not set any code page or loaded any other Powershell profile settings.

PS C:\Users\jaraco> echo £
£
PS C:\Users\jaraco> chcp
Active code page: 437
PS C:\Users\jaraco> echo £ | py -3 -c "import sys; print(repr(sys.stdin.read()))"
'?\n'
PS C:\Users\jaraco> chcp 65001
Active code page: 65001
PS C:\Users\jaraco> echo £ | py -3 -c "import sys; print(repr(sys.stdin.read()))"
'?\n'
PS C:\Users\jaraco> echo £ | py -3 -c "import sys; print(repr(sys.stdin.buffer.read()))"
b'?\r\n'

Curiously, it appears as if powershell is actually receiving a question mark from the pipe.
msg223192 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2014-07-16 12:07
Please use ascii() instead of repr() in your test to identify who
replaces characters with question marks.
msg223194 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-07-16 12:30
Bytes repr doesn't contains non-ascii characters, therefore Python is actually receiving a question mark from the pipe.

What are results of following commands?

py -3 -c "import sys; sys.stdout.buffer.write(bytes(range(128, 256)))"

py -3 -c "import sys; sys.stdout.buffer.write(bytes(range(128, 256)))" | py -3 -c "import sys; b = sys.stdin.buffer.read(); print(len(b), b)"
msg223238 - (view) Author: Eryk Sun (eryksun) * Date: 2014-07-16 17:43
> PS C:\Users\jaraco> echo £ | py -3 -c "import sys; print(repr(sys.stdin.buffer.read()))"
> b'?\r\n'

> Curiously, it appears as if powershell is actually receiving 
> a question mark from the pipe.

PowerShell calls ReadConsoleW to read the console input buffer, i.e. it reads "£" as a wide character from the command line. The default encoding when writing to the pipe should be ASCII [*]. If that's the case it explains the question mark that Python reads from stdin. It's the default replacement character (WC_DEFAULTCHAR) used by WideCharToMultiByte. 

[*] http://blogs.msdn.com/b/powershell/archive/2006/12/11/outputencoding-to-the-rescue.aspx

You can change PowerShell's output encoding to match the console:

    $OutputEncoding = [Console]::OutputEncoding

If the console codepage is 65001, the above is equivalent to setting 

    $OutputEncoding = [System.Text.Encoding]::UTF8

http://msdn.microsoft.com/en-us/library/system.text.encoding.utf8

As Victor mentioned, this setting always writes a BOM, and under codepage 65001 it actually writes 2 BOMs (at least in PowerShell 2). Victor also mentioned that you can avoid the BOM by passing $False to the constructor:

    $OutputEncoding = New-Object System.Text.UTF8Encoding($False)

http://msdn.microsoft.com/en-us/library/system.text.utf8encoding

There's still a BOM under codepage 65001, but maybe that's fixed in PowerShell 3. 

I avoid setting the console to codepage 65001 anyway. ReadFile/WriteFile incorrectly return the number of characters read/written instead of the number of bytes because the call is actually handled by ReadConsoleA/WriteConsoleA. Maybe that's finally fixed in Windows 8.
History
Date User Action Args
2014-07-16 17:43:08eryksunsetnosy: + eryksun
messages: + msg223238
2014-07-16 12:30:54serhiy.storchakasetmessages: + msg223194
2014-07-16 12:07:05hayposetmessages: + msg223192
2014-07-16 11:43:31jason.coombssetmessages: + msg223189
2014-07-16 06:07:10serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg223175
2014-07-15 23:12:26jason.coombssetmessages: + msg223157
2014-07-15 17:40:19r.david.murraysetnosy: + r.david.murray
messages: + msg223129
2014-07-11 16:15:06jason.coombssetmessages: + msg222761
2014-07-11 14:27:46hayposetmessages: + msg222748
2014-07-11 14:04:50hayposetmessages: + msg222743
2014-07-11 13:01:26jason.coombssetmessages: + msg222740
2014-07-10 18:23:09jason.coombssetmessages: + msg222683
2014-07-08 13:24:22ezio.melottisetnosy: + lemburg, loewis
type: behavior
messages: + msg222559
2014-07-06 14:39:22jason.coombscreate