classification
Title: bytes.fromhex should ignore all whitespace
Type: enhancement Stage: resolved
Components: Interpreter Core Versions: Python 3.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: serhiy.storchaka Nosy List: martin.panter, ncoghlan, nneonneo, python-dev, serhiy.storchaka, terry.reedy
Priority: normal Keywords: patch

Created on 2016-12-10 00:07 by nneonneo, last changed 2017-03-31 16:36 by dstufft. This issue is now closed.

Files
File name Uploaded Description Edit
fromhex.patch nneonneo, 2016-12-10 00:07 patch to use isspace for fromhex review
fromhex.patch nneonneo, 2016-12-17 00:38
fromhex.patch nneonneo, 2016-12-19 15:57 patch to skip all ASCII whitespace in bytes.fromhex review
fromhex.patch nneonneo, 2016-12-19 16:33 patch to skip all ASCII whitespace in bytes.fromhex review
Pull Requests
URL Status Linked Edit
PR 552 closed dstufft, 2017-03-31 16:36
Messages (19)
msg282811 - (view) Author: Robert Xiao (nneonneo) * Date: 2016-12-10 00:07
bytes.fromhex ignores space characters now (yay!) but still barfs if fed newlines or tabs:

>>> bytes.fromhex('ab\ncd')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: non-hexadecimal number found in fromhex() arg at position 2
>>> bytes.fromhex('ab\tcd')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: non-hexadecimal number found in fromhex() arg at position 2

It's often quite useful to paste blobs of hex into source code (or the REPL) and call ".fromhex" on them. These might include spaces, tabs and/or newlines, and barfing on these other whitespace characters is inconvenient.

I propose that bytes.fromhex should ignore all whitespace. A patch + test is attached.
msg282816 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2016-12-10 00:53
Seems a reasonable feature. The documentation would also need updating.

Which specific (whitespace) characters do you propose to ignore? Just ASCII ones, as in bytes.isspace(), or others like b"\xA0" (non-breaking space) and U+2028 (line separator), as in str.isspace()? Perhaps you should add test cases to clarify.
msg282817 - (view) Author: Robert Xiao (nneonneo) * Date: 2016-12-10 00:54
I used Py_ISSPACE, which uses the .strip() default charset - I think this is a reasonable choice. We don't have to go crazy and support all the Unicode spaces.
msg283449 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2016-12-16 23:27
I am a bit dubious about this.  There is a tradeoff here between convenience and bug detection.  The patch is not strictly necessary.

>>> bytes.fromhex('ab\ncd'.replace('\n', ''))
b'\xab\xcd'

Bytes (and bytearray) .fromhex already ignores spaces.  Not ignoring newlines, by default, may have been intentional.  Nick, you wrote the doc entry for 'fromhex'.  Do you know anything about the design intention?
msg283456 - (view) Author: Robert Xiao (nneonneo) * Date: 2016-12-17 00:38
Terry, can you elaborate what you mean by a tradeoff? I feel like such a patch makes .fromhex more consistent with other string methods like .split() and .strip() which implicitly consider all whitespace equivalent.

Martin, I've updated the patch to include documentation updates and more testcases.
msg283457 - (view) Author: Robert Xiao (nneonneo) * Date: 2016-12-17 00:39
Sorry, I should have clarified that these methods consider *ASCII whitespace* equivalent - just like my proposed patch.
msg283492 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2016-12-17 15:19
My recollection is that fromhex() ignores spaces to account for particularly common ways of formatting hex numbers as space separated groups:

    "CAFE F00D"
    "CAFEF00D CAFEF00D CAFEF00D"
    "CA FE F0 0D"
    etc

Those show up even in structured hexadecimal data (like hex editor output and memory dumps in log files).

That recollection is supported by the example and specification of specifically "[0-9a-fA-F ]" in PEP 358 (https://www.python.org/dev/peps/pep-0358/) that guided Georg Brandl's initial implementation in http://bugs.python.org/issue1669379

Generally speaking, the type level string parsers *aren't* permissive when it comes to their whitespace handling - if they allow whitespace at all, it's usually only at the beginning or end, where it gets ignored (hence the PEP for 3.6 to allow underscores in both numeric literals and in the numeric constructors).

=====================
>>> float(" 1.0")
1.0
>>> float("1.0 ")
1.0
>>> float("1 0")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: could not convert string to float: '1 0'
>>> float("1. 0")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: could not convert string to float: '1. 0'
>>> float("1 .0")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: could not convert string to float: '1 .0'
=====================

The general technique to strip whitespace from *any* input data before handing it to a constructor relies on str.translate:

=====================
>>> import string
>>> _del_whitespace = str.maketrans('', '', string.whitespace)
>>> def clean_whitespace(text):
...     return text.translate(_del_whitespace)
... 
>>> clean_whitespace('CA FE\nF0\t0D')
'CAFEF00D'
=====================

(http://stackoverflow.com/questions/3739909/how-to-strip-all-whitespace-from-string also points out `"".join(text.split())` as a clever one liner for a similar outcome)

So I'm inclined to advise *against* making any changes here - the apparent benefit is more a symptom of the fact that there isn't an immediately obvious spelling of "strip all whitespace, including that between other characters, from this string" that can be readily applied as an initial filter on the incoming data.

(However, I do sometimes wonder if it would make sense to offer a "str.stripall()" that defaulted to removing all whitespace, rather than treating this purely as a particular use case for str.translate)
msg283504 - (view) Author: Robert Xiao (nneonneo) * Date: 2016-12-17 18:24
I see your point, Nick. Can I offer a counterpoint?

Most of the string parsers operate only on relatively short inputs, like numbers. Numbers in particular are rarely written with inner spaces, so it makes sense not to ignore internal whitespaces.

On the other hand, hexadecimal data can be very long, and is often formatted with spaces and newlines. For example, the default output of `xxd -p`, a format quite suitable for copy-paste, looks like this:

cffaedfe07000001030000800200000015000000d8080000858021000000
000019000000480000005f5f504147455a45524f00000000000000000000
000000000000000001000000000000000000000000000000000000000000
000000000000000000000000000019000000180300005f5f544558540000
0000000000000000000000000100000000909d0100000000000000000000

It would be desirable to write something like

blob = bytes.fromhex('''
cffaedfe07000001030000800200000015000000d8080000858021000000
000019000000480000005f5f504147455a45524f00000000000000000000
000000000000000001000000000000000000000000000000000000000000
000000000000000000000000000019000000180300005f5f544558540000
0000000000000000000000000100000000909d0100000000000000000000
''')

and not have to worry about sticking in some whitespace remover, like this:

blob = bytes.fromhex(''.join('''
cffaedfe07000001030000800200000015000000d8080000858021000000
000019000000480000005f5f504147455a45524f00000000000000000000
000000000000000001000000000000000000000000000000000000000000
000000000000000000000000000019000000180300005f5f544558540000
0000000000000000000000000100000000909d0100000000000000000000
'''.split()))

or removing the newlines in the source code, which impacts readability. 

Similar kinds of whitespaced output (sometimes with spaces between octets, words or dwords, sometimes with tabs between 8-16 byte groups, sometimes with newlines between groups, etc.) can be found in the wild and from the "hex" clipboard output from various applications.

We can already have newlines and other whitespace with base64, which is in principle quite similar:

blob = base64.b64decode('''
z/rt/gcAAAEDAACAAgAAABUAAADYCAAAhYAhAAAAAAAZAAAASAAAAF9fUEFHRVpFUk8AAAAAAAAAAAAA
AAAAAAAAAAABAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAZAAAAGAMAAF9fVEVYVAAA
AAAAAAAAAAAAAAAAAQAAAACQnQEAAAAAAAAAAAAAAAAAkJ0BAAAAAAcAAAAFAAAACQAAAAAAAABfX3Rl
eHQAAAAAAAAAAAAAX19URVhUAAAAAAAAAAAAAAALAAABAAAARCF5AQAAAAAACwAACAAAAAAAAAAAAAAA
''')

so I think it makes sense to support other whitespaces in fromhex. I'm happy to reconsider if there's a strong argument against adding this convenience.
msg283542 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2016-12-18 07:07
Ah, the parallel with base64 decoding and embedding encoded data in multi-line string literals is indeed a compelling one - I'd missed that.

Given that rationale, +1 from me.

Perhaps it would make sense to call that out directly in the documentation? Something like a second example saying:

====
Ignoring all ASCII whitespace provides compatibility with common hexdump formats (like the output of ``xxd``), allowing such data to easily be read from a file or included directly in source code as a multiline string literal::

    >>> bytes.fromhex("""
    ... F0F1F2F3F4
    ... FAFBFCFDFE
    """)
    b'\xf0\xf1\xf2\xf4\xfa\xfb\xfc\xfd'

====

And then a versionchanged note for 3.7 to say that this was switched from filtering out specifically space to filtering out any ASCII whitespace.

My other question would be whether or not a separate issue should be filed to update the bytes-to-bytes "hex" codec to be consistent with this change - at the moment, it doesn't allow whitespace *at all* (not even ASCII spaces), while the base64 decoder is consistent with base64.b64decode and allows it.
msg283545 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-12-18 08:44
Is it worth to ignore also non-ASCII whitespaces for compatibility with string-to-number convertions and base64 decoder?

>>> float('\xa01.0')
1.0
>>> base64.decodebytes(b'\xa0YWJj')
b'abc'

Note that not all spaces are ignored. They shouldn't break hexadecimal pairs.

>>> bytes.fromhex('a b')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: non-hexadecimal number found in fromhex() arg at position 1
msg283549 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2016-12-18 11:03
As far as I know, non-ASCII newlines and whitespace are not supported in Python source code, so there is not a big need to support it in bytes.fromhex() either. But since bytes.fromhex() accepts Unicode strings, I think non-ASCII whitespace would be okay if it was easy to implement.

Serhiy: Whitespace is not treated specially by the base-64 decoders, it is just treated like any non-alphabet character. See <https://docs.python.org/3/library/base64.html#base64.b64decode> and b64decode(validate=True).

Regarding hex-codec, I doubt many people use it in Python 3. To decode a whole string, binascii.unhexlify() or bytes.fromhex() is probably more obvious, and I think the incremental decoder never worked properly (Issue 20132).
msg283551 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-12-18 11:42
No, it is not easy to implement. Currently bytes.fromhex() works only with ASCII strings and can just iterate over char*.

If add support of non-ASCII whitespaces, we should add a support of non-ASCII digits, as in float(). This looks excessive.

In case if there is a case for this, we rather would expose _PyUnicode_TransformDecimalAndSpaceToASCII() in Python level.

In general fromhex.patch LGTM, but please add versionchanged directives and document the change in What's New.
msg283571 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2016-12-19 00:31
+1 to the above Unicode whitespace discussion - the "ASCII space -> ASCII whitespace" change is relatively straightforward to implement, and clearly beneficial given Robert's point regarding the popularity of multi-line terminal-oriented hexdump formats.

Neither of those points apply to Unicode whitespace (it would be much harder to implement, and we don't have a demonstrated use case for it), so it makes sense to avoid going that far.
msg283627 - (view) Author: Robert Xiao (nneonneo) * Date: 2016-12-19 15:57
OK, I've attached a new version of the patch with the requested documentation changes (versionchanged and whatsnew).
msg283631 - (view) Author: Robert Xiao (nneonneo) * Date: 2016-12-19 16:33
New patch with proper line lengths in documentation.
msg283632 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-12-19 16:37
LGTM.
msg283633 - (view) Author: Roundup Robot (python-dev) Date: 2016-12-19 16:52
New changeset fcc09d9ee7d4 by Serhiy Storchaka in branch 'default':
Issue #28927: bytes.fromhex() and bytearray.fromhex() now ignore all ASCII
https://hg.python.org/cpython/rev/fcc09d9ee7d4
msg283634 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-12-19 16:53
Thank you for your contribution Robert.
msg283639 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2016-12-19 18:13
I think non-ASCII whitespace and digits are YAGNI until we are convinced otherwise by evidence from the field that people are routinely mixing other decimal digits with 'abcdef' as hex numerals.  Anyone who does try such a thing can write a wrapper that first translates to ascii digits.
History
Date User Action Args
2017-03-31 16:36:25dstufftsetpull_requests: + pull_request994
2016-12-19 18:13:33terry.reedysetmessages: + msg283639
2016-12-19 16:53:57serhiy.storchakasetstatus: open -> closed
resolution: fixed
messages: + msg283634

stage: commit review -> resolved
2016-12-19 16:52:12python-devsetnosy: + python-dev
messages: + msg283633
2016-12-19 16:37:46serhiy.storchakasetassignee: serhiy.storchaka
messages: + msg283632
components: + Interpreter Core
stage: patch review -> commit review
2016-12-19 16:33:26nneonneosetfiles: + fromhex.patch

messages: + msg283631
2016-12-19 15:57:23nneonneosetfiles: + fromhex.patch

messages: + msg283627
2016-12-19 00:31:57ncoghlansetmessages: + msg283571
2016-12-18 11:42:06serhiy.storchakasetmessages: + msg283551
2016-12-18 11:03:02martin.pantersetmessages: + msg283549
2016-12-18 08:44:32serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg283545
2016-12-18 07:07:08ncoghlansetmessages: + msg283542
2016-12-17 18:24:09nneonneosetmessages: + msg283504
2016-12-17 15:19:53ncoghlansetmessages: + msg283492
2016-12-17 00:39:17nneonneosetmessages: + msg283457
2016-12-17 00:38:25nneonneosetfiles: + fromhex.patch

messages: + msg283456
2016-12-16 23:27:38terry.reedysetnosy: + terry.reedy, ncoghlan
messages: + msg283449
2016-12-10 00:54:39nneonneosetmessages: + msg282817
2016-12-10 00:53:04martin.pantersetversions: + Python 3.7, - Python 3.6
nosy: + martin.panter

messages: + msg282816

stage: patch review
2016-12-10 00:07:26nneonneocreate