Author vinay.sajip
Recipients eric.araujo, eric.smith, niemeyer, robodan, vinay.sajip
Date 2012-02-21.17:27:35
SpamBayes Score 0.00015203
Marked as misclassified No
Message-id <1329845256.46.0.643149543593.issue1521950@psf.upfronthosting.co.za>
In-reply-to
Content
I updated the patch to reflect Éric's comments on Rietveld, but there are also some other changes:

Previously when punctuation chars were set, wordchars was being augmented by '-'. This was incomplete, so the augmentation is now with '~-./*?=' which allows for wildcards, filename chars and argument flags.

I added a token_type attribute whose value is 'a' for alphanumeric tokens and 'c' for punctuation tokens. This token type is internally tracked anyway - we just expose it now. It is needed for when multiple punctuation tokens need to be disambiguated, because we might return two logically separate punctuation tokens as one if they are not separated by whitespace in the source being tokenised.

New attributes and the changes to wordchars have been documented, and a test added for token_type return values.
History
Date User Action Args
2012-02-21 17:27:36vinay.sajipsetrecipients: + vinay.sajip, niemeyer, eric.smith, robodan, eric.araujo
2012-02-21 17:27:36vinay.sajipsetmessageid: <1329845256.46.0.643149543593.issue1521950@psf.upfronthosting.co.za>
2012-02-21 17:27:35vinay.sajiplinkissue1521950 messages
2012-02-21 17:27:35vinay.sajipcreate