Message 363659 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	brian.gallagher
Recipients	brian.gallagher, lemburg, tim.peters
Date	2020-03-08.12:57:23
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1583672243.36.0.884165031963.issue39891@roundup.psfhosted.org>
In-reply-to

Content
I agree that there is an appeal to leaving any normalization to the application and that trying guess what people want is a tough hole -- I hadn't even considered what casing would mean in a general sense for Unicode. I'm not entirely convinced that this should be pursued either, but I'll refine my proposal, provide a little context in which I thought it could be a problem and see what you guys think. 1. Some code is written that assumes get_close_matches() will match on a case-insensitive basis. Only a small bit of testing is done because the functionality is provided by the standard library not the application code, so we throw a few examples like 'apple' and 'ape' and decide it is okay. We later on discover we have a bug because we actually need to match against 'AppLE' too. 2. The extension I had in mind was to match on a case-insensitive basis for only the alphabet characters. I don't know much about Unicode, but there's definitely gotchas lurking in my previous statement (titlecase vs. uppercase) so copying the behaviour of string.upper()/string.lower() would seem reasonable to me. The functionality would only be extended to match the same strings it would anyways, but now ignore casing. We wouldn't be eliminating any existing matches. I guess this still has the potential to be a breaking change, since someone might indirectly be depending on this. For 1., not testing that your code can handle mixed case comparisons in the way you're assuming it will is probably your own fault. On the other hand, I think it is a reasonable assumption to think that get_close_matches() will match an uppercase/lowercase counterpart since the function's intent is to provide intuitive matches that "look right" to a human. Maybe this is more of a documentation issue than something that needs to be addressed in the code. If a caveat about the case sensitivity of the function is added to the documentation, then a developer can be aware of the limitation in order to provide any normalization they want in the application code. Let me know what you guys think.

I agree that there is an appeal to leaving any normalization to the application and that trying guess what people want is a tough hole -- I hadn't even considered what casing would mean in a general sense for Unicode.

I'm not entirely convinced that this should be pursued either, but I'll refine my proposal, provide a little context in which I thought it could be a problem and see what you guys think.

1. Some code is written that assumes get_close_matches() will match on a case-insensitive basis. Only a small bit of testing is done because the functionality is provided by the standard library not the application code, so we throw a few examples like 'apple' and 'ape' and decide it is okay. We later on discover we have a bug because we actually need to match against 'AppLE' too.

2. The extension I had in mind was to match on a case-insensitive basis for only the alphabet characters. I don't know much about Unicode, but there's definitely gotchas lurking in my previous statement (titlecase vs. uppercase) so copying the behaviour of string.upper()/string.lower() would seem reasonable to me. The functionality would only be extended to match the same strings it would anyways, but now ignore casing. We wouldn't be eliminating any existing matches. I guess this still has the potential to be a breaking change, since someone might indirectly be depending on this.

For 1., not testing that your code can handle mixed case comparisons in the way you're assuming it will is probably your own fault. On the other hand, I think it is a reasonable assumption to think that get_close_matches() will match an uppercase/lowercase counterpart since the function's intent is to provide intuitive matches that "look right" to a human.

Maybe this is more of a documentation issue than something that needs to be addressed in the code. If a caveat about the case sensitivity of the function is added to the documentation, then a developer can be aware of the limitation in order to provide any normalization they want in the application code.

Let me know what you guys think.

History
Date	User	Action	Args
2020-03-08 12:57:23	brian.gallagher	set	recipients: + brian.gallagher, lemburg, tim.peters
2020-03-08 12:57:23	brian.gallagher	set	messageid: <1583672243.36.0.884165031963.issue39891@roundup.psfhosted.org>
2020-03-08 12:57:23	brian.gallagher	link	issue39891 messages
2020-03-08 12:57:23	brian.gallagher	create