Message 148417 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	robodan
Recipients	eric.araujo, eric.smith, niemeyer, robodan
Date	2011-11-26.17:46:36
SpamBayes Score	2.7755576e-16
Marked as misclassified	No
Message-id	<CAEcGQ+OO9c7Cg--OD-35zCGXWRjnkQ+q=EHQWNcd5nnR7UGUrg@mail.gmail.com>
In-reply-to	<1322321101.44.0.549991421257.issue1521950@psf.upfronthosting.co.za>

Content
> Sure :) That’s why I suggest using dash for quick tests and rely on the work of other people who did read the POSIX spec. I’ll have to check it too before committing a patch. The point of ref_shlex.py is that all shells act the same for common cases and shlex doesn't match any of them. The only real split it that csh based shells do some things differently that sh based shells ('2>' vs '&>'). >> shlex uses a series of character strings to drive it's parsing: whitespace, escape, quotes. >> Add another one: control = '();<>\|&'. If it is unset (by default?), then the behavior is as >> before. > So we would need to add a Shlex subclass to the module to provide the new behavior. I think I prefer a new argument, because we can just extend the existing class and functions instead of adding subtly differing duplicates. You don't have to do a subclass (although that might have some advantages). You could do something like: def shlex(s, comments=False, posix=True, control=False): ... if control: if control is True: self.control = '();<>\|&' else: self.control = control # let user specify their own control set >> If it is set, then shlex will output any character in control as a separate token. > Unless it is part of a quoted segment, right? (See #7611 for 'foo#bar' vs. 'foo #bar'). Correct, quotes wouldn't change. >> There might be a shell specific script (or maybe it's left to the user) >> that decides that certain tokens can be recombined: > Seems to much complexity. I really prefer if we agree on one command parsing behavior (POSIX, i.e. dash) and improve shlex to support that. People wanting zsh rules can write their own subclass. shlex is a pretty simple lexer (as lexers go), and I wouldn't want it to get complicated. It's easier in the current code structure to split everything and then re-join as needed. This also allows you to select sh vs csh joining rules (e.g. '\|&' means different things in sh vs csh). Every shell that I've seen follows one of those two flavors for syntax. >> '&&', '\|\|', '\|&', '>>', etc. > Wouldn’t it be more correct to consider them different tokens? I don’t have a format training in CS or programming, so I’m not sure that my definition is correct at all, but in my mind a token is a unit, and thus & and && are two different things. Ideally, the final tokens have exact meanings. It easier to write handler code for '&&' than ('&', '&'). This is just a case of whether the parse joins them together or it's done in a second step. The current code doesn't do much look ahead, so it's hard for the lexer to produce things like '&&' directly. -Dan

> Sure :)  That’s why I suggest using dash for quick tests and rely on the work of other people who did read the POSIX spec.  I’ll have to check it too before committing a patch.

The point of ref_shlex.py is that all shells act the same for common
cases and shlex doesn't match any of them.  The only real split it
that csh based shells do some things differently that sh based shells
('2>' vs '&>').

>> shlex uses a series of character strings to drive it's parsing:  whitespace, escape, quotes.
>> Add another one: control = '();<>|&'.  If it is unset (by default?), then the behavior is as
>> before.
> So we would need to add a Shlex subclass to the module to provide the new behavior.  I think I prefer a new argument, because we can just extend the existing class and functions instead of adding subtly differing duplicates.

You don't have to do a subclass (although that might have some
advantages).  You could do something like:
def shlex(s, comments=False, posix=True, control=False):
...
  if control:
    if control is True:
      self.control = '();<>|&'
    else:
      self.control = control  # let user specify their own control set

>> If it is set, then shlex will output any character in control as a separate token.
> Unless it is part of a quoted segment, right?  (See #7611 for 'foo#bar' vs. 'foo #bar').

Correct, quotes wouldn't change.

>> There might be a shell specific script (or maybe it's left to the user)
>> that decides that certain tokens can be recombined:
> Seems to much complexity.  I really prefer if we agree on one command parsing behavior (POSIX, i.e. dash) and improve shlex to support that.  People wanting zsh rules can write their own subclass.

shlex is a pretty simple lexer (as lexers go), and I wouldn't want it
to get complicated.  It's easier in the current code structure to
split everything and then re-join as needed.  This also allows you to
select sh vs csh joining rules (e.g. '|&' means different things in sh
vs csh).  Every shell that I've seen follows one of those two flavors
for syntax.

>> '&&', '||', '|&', '>>', etc.
> Wouldn’t it be more correct to consider them different tokens?  I don’t have a format training in CS or programming, so I’m not sure that my definition is correct at all, but in my mind a token is a unit, and thus & and && are two different things.

Ideally, the final tokens have exact meanings.  It easier to write
handler code for '&&' than ('&', '&').  This is just a case of whether
the parse joins them together or it's done in a second step.  The
current code doesn't do much look ahead, so it's hard for the lexer to
produce things like '&&' directly.

-Dan

History
Date	User	Action	Args
2011-11-26 17:46:37	robodan	set	recipients: + robodan, niemeyer, eric.smith, eric.araujo
2011-11-26 17:46:37	robodan	link	issue1521950 messages
2011-11-26 17:46:36	robodan	create