This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: tokenizer permits invalid hex integer
Type: compile error Stage:
Components: Interpreter Core Versions: Python 2.6
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: gvanrossum Nosy List: MartinRinehart, georg.brandl, gvanrossum, maltehelmert
Priority: normal Keywords: easy

Created on 2007-12-21 12:35 by MartinRinehart, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
PATCH-1.diff maltehelmert, 2008-01-19 16:58 patch for Python's tokenizer (case 1)
PATCH-3.diff maltehelmert, 2008-01-19 17:00 patch for tokenize.py (case 3)
PATCH-2a.diff maltehelmert, 2008-01-19 17:33
PATCH-2b.diff maltehelmert, 2008-01-19 17:49 patch for builtin int (case 2; cleaner code than PATCH-2a.diff)
PATCH-TESTS.diff maltehelmert, 2008-01-19 18:52 patch to unit tests (grammar, builtin, tokenize)
unnamed MartinRinehart, 2008-01-22 20:55
Messages (9)
msg58943 - (view) Author: Martin Rinehart (MartinRinehart) Date: 2007-12-21 12:35
The tokenizer accepts '0x' as an integer zero. The documentation says:
   hexinteger ::= 0x|Xhexdigit+

Stumbled on this testing a tokenizer I wrote in Python for another
language. Expected an Error on "int( '0x', 16 )", but didn't get one.
msg60196 - (view) Author: Malte Helmert (maltehelmert) Date: 2008-01-19 16:56
I can find three places where "0x" is accepted, but probably shouldn't:

1. Python's tokenizer:
>>> 0x
0
>>> 0xL
ValueError: invalid literal for long() with base 16: '0xL'
=> I think these should both be syntax errors.

2. int builtin:
>>> int("0x", 0) == int("0x", 16) == 0
True
>>> long("0x", 0)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: invalid literal for long() with base 16: '0x'
>>> long("0x", 16)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: invalid literal for long()

=> The long behaviour looks right to me, and I think the int behaviour
should match it.

3. tokenize module:
This currently accepts "0x" and "0xL" as single tokens. The obvious fix
would lead to these two being reported as two separate tokens ("0":
NUMBER, "x": NAME; "0": NUMBER, "xL": NAME), as it currently does for
other cases where a name follows a number like "23cats". However, this
is not quite what Python's parser does, which returns an error token
instead. (Fortunately, name after number appears to be a syntax error
everywhere, so it doesn't really affect the behaviour; a syntax error
occurs either way.)
msg60197 - (view) Author: Malte Helmert (maltehelmert) Date: 2008-01-19 16:58
Here's a patch that fixes case 1:

>>> 0x
  File "<stdin>", line 1
    0x
     ^
SyntaxError: invalid token
>>> 0xL
  File "<stdin>", line 1
    0xL
     ^
SyntaxError: invalid token
msg60198 - (view) Author: Malte Helmert (maltehelmert) Date: 2008-01-19 17:00
And here's a patch that fixes case 3.
msg60201 - (view) Author: Malte Helmert (maltehelmert) Date: 2008-01-19 17:33
And here's a patch for case 2 (int) conversion. There is still a slight
inconsistency in error reporting (base 0 vs. base 16) between int and
long, but I'd see this as long's fault:

>>> int("0x", 0)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 0: '0x'
>>> int("0x", 16)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 16: '0x'
>>> long("0x", 0)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: invalid literal for long() with base 16: '0x'
>>> long("0x", 16)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: invalid literal for long() with base 16: '0x'

The patch is not pretty because it duplicates a lot of code, but it's
probably easier to see what was changed that way. I'll add a prettier
patch soon.
msg60204 - (view) Author: Malte Helmert (maltehelmert) Date: 2008-01-19 17:49
This is a cleaner version of PATCH-2a.diff in the sense that the
resulting code contains less duplication. The disadvantage is that it
applies more structural changes to PyOS_strtoul, so may be harder to
merge with other changes.
msg60211 - (view) Author: Malte Helmert (maltehelmert) Date: 2008-01-19 18:52
Added tests to test_grammar, test_builtin and test_tokenize.
msg60218 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2008-01-19 19:28
Committed patches 1, 2a and 3, and test suite updates, in r60092. This
won't be backported to 2.5, and no doc changes are necessary. Thanks for
your work!
msg61538 - (view) Author: Martin Rinehart (MartinRinehart) Date: 2008-01-22 20:55
re 0x == 0

Thanks!
History
Date User Action Args
2022-04-11 14:56:29adminsetgithub: 46020
2008-01-22 20:55:53MartinRinehartsetfiles: + unnamed
messages: + msg61538
2008-01-19 19:29:47georg.brandlsetstatus: open -> closed
resolution: fixed
2008-01-19 19:28:31georg.brandlsetnosy: + georg.brandl
messages: + msg60218
2008-01-19 18:52:18maltehelmertsetfiles: + PATCH-TESTS.diff
messages: + msg60211
2008-01-19 17:49:44maltehelmertsetfiles: + PATCH-2b.diff
messages: + msg60204
2008-01-19 17:33:48maltehelmertsetfiles: + PATCH-2a.diff
messages: + msg60201
2008-01-19 17:00:12maltehelmertsetfiles: + PATCH-3.diff
messages: + msg60198
2008-01-19 16:58:52maltehelmertsetfiles: + PATCH-1.diff
messages: + msg60197
2008-01-19 16:56:05maltehelmertsetnosy: + maltehelmert
messages: + msg60196
2008-01-12 01:36:47akuchlingsetkeywords: + easy
2007-12-21 17:02:42gvanrossumsetpriority: normal
assignee: gvanrossum
nosy: + gvanrossum
2007-12-21 12:35:46MartinRinehartcreate