Issue 28927: bytes.fromhex should ignore all whitespace

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/73113

classification

Title:	bytes.fromhex should ignore all whitespace
Type:	enhancement	Stage:	resolved
Components:	Interpreter Core	Versions:	Python 3.7

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	serhiy.storchaka	Nosy List:	martin.panter, ncoghlan, nneonneo, python-dev, serhiy.storchaka, terry.reedy
Priority:	normal	Keywords:	patch

Created on 2016-12-10 00:07 by nneonneo, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
fromhex.patch	nneonneo, 2016-12-10 00:07	patch to use isspace for fromhex	review
fromhex.patch	nneonneo, 2016-12-17 00:38
fromhex.patch	nneonneo, 2016-12-19 15:57	patch to skip all ASCII whitespace in bytes.fromhex	review
fromhex.patch	nneonneo, 2016-12-19 16:33	patch to skip all ASCII whitespace in bytes.fromhex	review

Pull Requests
URL	Status	Linked	Edit
PR 552	closed	dstufft, 2017-03-31 16:36

Messages (19)
msg282811 - (view)	Author: Robert Xiao (nneonneo) *	Date: 2016-12-10 00:07
bytes.fromhex ignores space characters now (yay!) but still barfs if fed newlines or tabs: >>> bytes.fromhex('ab\ncd') Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: non-hexadecimal number found in fromhex() arg at position 2 >>> bytes.fromhex('ab\tcd') Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: non-hexadecimal number found in fromhex() arg at position 2 It's often quite useful to paste blobs of hex into source code (or the REPL) and call ".fromhex" on them. These might include spaces, tabs and/or newlines, and barfing on these other whitespace characters is inconvenient. I propose that bytes.fromhex should ignore all whitespace. A patch + test is attached.
msg282816 - (view)	Author: Martin Panter (martin.panter) *	Date: 2016-12-10 00:53
Seems a reasonable feature. The documentation would also need updating. Which specific (whitespace) characters do you propose to ignore? Just ASCII ones, as in bytes.isspace(), or others like b"\xA0" (non-breaking space) and U+2028 (line separator), as in str.isspace()? Perhaps you should add test cases to clarify.
msg282817 - (view)	Author: Robert Xiao (nneonneo) *	Date: 2016-12-10 00:54
I used Py_ISSPACE, which uses the .strip() default charset - I think this is a reasonable choice. We don't have to go crazy and support all the Unicode spaces.
msg283449 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2016-12-16 23:27
I am a bit dubious about this. There is a tradeoff here between convenience and bug detection. The patch is not strictly necessary. >>> bytes.fromhex('ab\ncd'.replace('\n', '')) b'\xab\xcd' Bytes (and bytearray) .fromhex already ignores spaces. Not ignoring newlines, by default, may have been intentional. Nick, you wrote the doc entry for 'fromhex'. Do you know anything about the design intention?
msg283456 - (view)	Author: Robert Xiao (nneonneo) *	Date: 2016-12-17 00:38
Terry, can you elaborate what you mean by a tradeoff? I feel like such a patch makes .fromhex more consistent with other string methods like .split() and .strip() which implicitly consider all whitespace equivalent. Martin, I've updated the patch to include documentation updates and more testcases.
msg283457 - (view)	Author: Robert Xiao (nneonneo) *	Date: 2016-12-17 00:39
Sorry, I should have clarified that these methods consider ASCII whitespace equivalent - just like my proposed patch.
msg283492 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2016-12-17 15:19
My recollection is that fromhex() ignores spaces to account for particularly common ways of formatting hex numbers as space separated groups: "CAFE F00D" "CAFEF00D CAFEF00D CAFEF00D" "CA FE F0 0D" etc Those show up even in structured hexadecimal data (like hex editor output and memory dumps in log files). That recollection is supported by the example and specification of specifically "[0-9a-fA-F ]" in PEP 358 (https://www.python.org/dev/peps/pep-0358/) that guided Georg Brandl's initial implementation in http://bugs.python.org/issue1669379 Generally speaking, the type level string parsers aren't permissive when it comes to their whitespace handling - if they allow whitespace at all, it's usually only at the beginning or end, where it gets ignored (hence the PEP for 3.6 to allow underscores in both numeric literals and in the numeric constructors). ===================== >>> float(" 1.0") 1.0 >>> float("1.0 ") 1.0 >>> float("1 0") Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: could not convert string to float: '1 0' >>> float("1. 0") Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: could not convert string to float: '1. 0' >>> float("1 .0") Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: could not convert string to float: '1 .0' ===================== The general technique to strip whitespace from any input data before handing it to a constructor relies on str.translate: ===================== >>> import string >>> _del_whitespace = str.maketrans('', '', string.whitespace) >>> def clean_whitespace(text): ... return text.translate(_del_whitespace) ... >>> clean_whitespace('CA FE\nF0\t0D') 'CAFEF00D' ===================== (http://stackoverflow.com/questions/3739909/how-to-strip-all-whitespace-from-string also points out `"".join(text.split())` as a clever one liner for a similar outcome) So I'm inclined to advise against making any changes here - the apparent benefit is more a symptom of the fact that there isn't an immediately obvious spelling of "strip all whitespace, including that between other characters, from this string" that can be readily applied as an initial filter on the incoming data. (However, I do sometimes wonder if it would make sense to offer a "str.stripall()" that defaulted to removing all whitespace, rather than treating this purely as a particular use case for str.translate)
msg283504 - (view)	Author: Robert Xiao (nneonneo) *	Date: 2016-12-17 18:24
I see your point, Nick. Can I offer a counterpoint? Most of the string parsers operate only on relatively short inputs, like numbers. Numbers in particular are rarely written with inner spaces, so it makes sense not to ignore internal whitespaces. On the other hand, hexadecimal data can be very long, and is often formatted with spaces and newlines. For example, the default output of `xxd -p`, a format quite suitable for copy-paste, looks like this: cffaedfe07000001030000800200000015000000d8080000858021000000 000019000000480000005f5f504147455a45524f00000000000000000000 000000000000000001000000000000000000000000000000000000000000 000000000000000000000000000019000000180300005f5f544558540000 0000000000000000000000000100000000909d0100000000000000000000 It would be desirable to write something like blob = bytes.fromhex(''' cffaedfe07000001030000800200000015000000d8080000858021000000 000019000000480000005f5f504147455a45524f00000000000000000000 000000000000000001000000000000000000000000000000000000000000 000000000000000000000000000019000000180300005f5f544558540000 0000000000000000000000000100000000909d0100000000000000000000 ''') and not have to worry about sticking in some whitespace remover, like this: blob = bytes.fromhex(''.join(''' cffaedfe07000001030000800200000015000000d8080000858021000000 000019000000480000005f5f504147455a45524f00000000000000000000 000000000000000001000000000000000000000000000000000000000000 000000000000000000000000000019000000180300005f5f544558540000 0000000000000000000000000100000000909d0100000000000000000000 '''.split())) or removing the newlines in the source code, which impacts readability. Similar kinds of whitespaced output (sometimes with spaces between octets, words or dwords, sometimes with tabs between 8-16 byte groups, sometimes with newlines between groups, etc.) can be found in the wild and from the "hex" clipboard output from various applications. We can already have newlines and other whitespace with base64, which is in principle quite similar: blob = base64.b64decode(''' z/rt/gcAAAEDAACAAgAAABUAAADYCAAAhYAhAAAAAAAZAAAASAAAAF9fUEFHRVpFUk8AAAAAAAAAAAAA AAAAAAAAAAABAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAZAAAAGAMAAF9fVEVYVAAA AAAAAAAAAAAAAAAAAQAAAACQnQEAAAAAAAAAAAAAAAAAkJ0BAAAAAAcAAAAFAAAACQAAAAAAAABfX3Rl eHQAAAAAAAAAAAAAX19URVhUAAAAAAAAAAAAAAALAAABAAAARCF5AQAAAAAACwAACAAAAAAAAAAAAAAA ''') so I think it makes sense to support other whitespaces in fromhex. I'm happy to reconsider if there's a strong argument against adding this convenience.
msg283542 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2016-12-18 07:07
Ah, the parallel with base64 decoding and embedding encoded data in multi-line string literals is indeed a compelling one - I'd missed that. Given that rationale, +1 from me. Perhaps it would make sense to call that out directly in the documentation? Something like a second example saying: ==== Ignoring all ASCII whitespace provides compatibility with common hexdump formats (like the output of ``xxd``), allowing such data to easily be read from a file or included directly in source code as a multiline string literal:: >>> bytes.fromhex(""" ... F0F1F2F3F4 ... FAFBFCFDFE """) b'\xf0\xf1\xf2\xf4\xfa\xfb\xfc\xfd' ==== And then a versionchanged note for 3.7 to say that this was switched from filtering out specifically space to filtering out any ASCII whitespace. My other question would be whether or not a separate issue should be filed to update the bytes-to-bytes "hex" codec to be consistent with this change - at the moment, it doesn't allow whitespace at all (not even ASCII spaces), while the base64 decoder is consistent with base64.b64decode and allows it.
msg283545 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2016-12-18 08:44
Is it worth to ignore also non-ASCII whitespaces for compatibility with string-to-number convertions and base64 decoder? >>> float('\xa01.0') 1.0 >>> base64.decodebytes(b'\xa0YWJj') b'abc' Note that not all spaces are ignored. They shouldn't break hexadecimal pairs. >>> bytes.fromhex('a b') Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: non-hexadecimal number found in fromhex() arg at position 1
msg283549 - (view)	Author: Martin Panter (martin.panter) *	Date: 2016-12-18 11:03
As far as I know, non-ASCII newlines and whitespace are not supported in Python source code, so there is not a big need to support it in bytes.fromhex() either. But since bytes.fromhex() accepts Unicode strings, I think non-ASCII whitespace would be okay if it was easy to implement. Serhiy: Whitespace is not treated specially by the base-64 decoders, it is just treated like any non-alphabet character. See <https://docs.python.org/3/library/base64.html#base64.b64decode> and b64decode(validate=True). Regarding hex-codec, I doubt many people use it in Python 3. To decode a whole string, binascii.unhexlify() or bytes.fromhex() is probably more obvious, and I think the incremental decoder never worked properly (Issue 20132).
msg283551 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2016-12-18 11:42
No, it is not easy to implement. Currently bytes.fromhex() works only with ASCII strings and can just iterate over char*. If add support of non-ASCII whitespaces, we should add a support of non-ASCII digits, as in float(). This looks excessive. In case if there is a case for this, we rather would expose _PyUnicode_TransformDecimalAndSpaceToASCII() in Python level. In general fromhex.patch LGTM, but please add versionchanged directives and document the change in What's New.
msg283571 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2016-12-19 00:31
+1 to the above Unicode whitespace discussion - the "ASCII space -> ASCII whitespace" change is relatively straightforward to implement, and clearly beneficial given Robert's point regarding the popularity of multi-line terminal-oriented hexdump formats. Neither of those points apply to Unicode whitespace (it would be much harder to implement, and we don't have a demonstrated use case for it), so it makes sense to avoid going that far.
msg283627 - (view)	Author: Robert Xiao (nneonneo) *	Date: 2016-12-19 15:57
OK, I've attached a new version of the patch with the requested documentation changes (versionchanged and whatsnew).
msg283631 - (view)	Author: Robert Xiao (nneonneo) *	Date: 2016-12-19 16:33
New patch with proper line lengths in documentation.
msg283632 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2016-12-19 16:37
LGTM.
msg283633 - (view)	Author: Roundup Robot (python-dev)	Date: 2016-12-19 16:52
New changeset fcc09d9ee7d4 by Serhiy Storchaka in branch 'default': Issue #28927: bytes.fromhex() and bytearray.fromhex() now ignore all ASCII https://hg.python.org/cpython/rev/fcc09d9ee7d4
msg283634 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2016-12-19 16:53
Thank you for your contribution Robert.
msg283639 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2016-12-19 18:13
I think non-ASCII whitespace and digits are YAGNI until we are convinced otherwise by evidence from the field that people are routinely mixing other decimal digits with 'abcdef' as hex numerals. Anyone who does try such a thing can write a wrapper that first translates to ascii digits.

History
Date	User	Action	Args
2022-04-11 14:58:40	admin	set	github: 73113
2017-03-31 16:36:25	dstufft	set	pull_requests: + pull_request994
2016-12-19 18:13:33	terry.reedy	set	messages: + msg283639
2016-12-19 16:53:57	serhiy.storchaka	set	status: open -> closed resolution: fixed messages: + msg283634 stage: commit review -> resolved
2016-12-19 16:52:12	python-dev	set	nosy: + python-dev messages: + msg283633
2016-12-19 16:37:46	serhiy.storchaka	set	assignee: serhiy.storchaka messages: + msg283632 components: + Interpreter Core stage: patch review -> commit review
2016-12-19 16:33:26	nneonneo	set	files: + fromhex.patch messages: + msg283631
2016-12-19 15:57:23	nneonneo	set	files: + fromhex.patch messages: + msg283627
2016-12-19 00:31:57	ncoghlan	set	messages: + msg283571
2016-12-18 11:42:06	serhiy.storchaka	set	messages: + msg283551
2016-12-18 11:03:02	martin.panter	set	messages: + msg283549
2016-12-18 08:44:32	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg283545
2016-12-18 07:07:08	ncoghlan	set	messages: + msg283542
2016-12-17 18:24:09	nneonneo	set	messages: + msg283504
2016-12-17 15:19:53	ncoghlan	set	messages: + msg283492
2016-12-17 00:39:17	nneonneo	set	messages: + msg283457
2016-12-17 00:38:25	nneonneo	set	files: + fromhex.patch messages: + msg283456
2016-12-16 23:27:38	terry.reedy	set	nosy: + terry.reedy, ncoghlan messages: + msg283449
2016-12-10 00:54:39	nneonneo	set	messages: + msg282817
2016-12-10 00:53:04	martin.panter	set	versions: + Python 3.7, - Python 3.6 nosy: + martin.panter messages: + msg282816 stage: patch review
2016-12-10 00:07:26	nneonneo	create