classification
Title: IDLE menu option to convert non-ascii quotes & other?
Type: enhancement Stage: test needed
Components: IDLE Versions: Python 3.8
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: terry.reedy Nosy List: cheryl.sabella, rhettinger, serhiy.storchaka, terry.reedy
Priority: normal Keywords:

Created on 2019-03-07 00:10 by rhettinger, last changed 2019-03-09 01:27 by rhettinger.

Messages (5)
msg337350 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2019-03-07 00:10
Some of my students routinely have to copy code samples from PDF documents where the regular Python acceptable ASCII quotation marks have been replaced by smart quotes.  Let's add an Edit menu option to fix smart-quotes.
msg337358 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-03-07 03:53
Also dashes and hyphens to minuses and non-breaking spaces to normal spaces.
msg337429 - (view) Author: Cheryl Sabella (cheryl.sabella) * (Python committer) Date: 2019-03-07 19:07
Would it be worthwhile to automatically convert the text when it's being pasted or would there be a scenario where it would be desirable to keep these characters in the text?  It seems the point here is that the user wouldn't even realize that the quotes (or dashes) being copied aren't the right ones and they would have to learn to take the extra step of formatting the text.  That seems annoying, so maybe automatic conversion would eliminate that?

For the menu option route, in the editor there is an additional 'Format' menu which has some text manipulation options, but the Shell doesn't have this menu available.  There isn't any formatting options on the 'Edit' menu currently.  Would it be better to add a 'Format' menu to the Shell or have this on the 'Edit' menu (which is already getting long)?

For the actual text conversion, I pasted some smart quotes on Windows and it pasted as \u2018\u2018 (two single left quotations marks) and \u2019\u2019 (two single right quotation marks) instead of \u201C (double left) and \u201D (double right). \u0060 (grave accent) and \u00B4 (acute accent) also seem to be possible values that are used for quotes, although converting them automatically may be more problematic.

I think for starters the idea would be:
text.replace('\u2018\u2018', '"')  
text.replace('\u2019\u2019', '"')  
text.replace('\u2018, "'")
text.replace('\u2019, "'")
text.replace('\u201C, '"')
text.replace('\u201D, '"')

The dash may be more complicated since there are more of them.  Unless the category could be used.
msg337549 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2019-03-09 01:04
I support adding a new function, with these notes.

1. Let's limit the scope to actual reversible bugs introduced by 3rd party software we care about.  Let's not try to anticipate every possible issue.  Also, once we have a function to replace some unicode chars, I can imagine users requesting replacement of other unicode chars, such as math X-like multiplication symbol by '*'.  I am pretty sure that encouraging intentional unicode extensions would not pass core-dev review. ;-)

Raymond, do users encounter all of the characters and combinations Cheryl suggested?  Serhiy, do you know if real pdfs make the other changes you pointed at? Can you provide or suggest a specific test string?

2. I want to put the new feature on the Format menu.  A. The Edit menu is already overly long and B) the other items on Format already do various selection or whole-text fixups (inserts, replacements, and deletions). Possible menu entry: 'Replace non-ascii chars'.  This is 23 chars; the current longest entry is 25.  A 'hotkey' is not needed for something so rarely used.  (Some of the other items on Format don't need them either.)

I think including Format on the Shell menu, with a subset of entries active, should be a follow-up issue.  Another possible follow-up is to check pasted or opened text and offer to edit if appropriate.  I am wary of doing so automatically, especially to start.

3. We should not replace within strings and comments, but mangled strings may be hard to recognize as such.  Suppose '’' is mangled to ‘’’ (\u2018\u2019\u2019, open-close-close).  I am not sure how we should recognize to leave the middle character as is, except to reject anything that results in a syntax error.  I would rather do too few rather than too many edits.  I will be happy if we can start with something useful, not wrong, tested, and documented.
msg337550 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2019-03-09 01:27
> Raymond, do users encounter all of the characters and combinations Cheryl suggested?

The only recurring issue is with the smart quotes.

For anything else, perhaps there can be a box on the General configuration tab for additional source/dest replacement pairs.
History
Date User Action Args
2019-03-09 01:27:51rhettingersetmessages: + msg337550
2019-03-09 01:04:16terry.reedysettype: enhancement
title: Add edit option in IDLE to convert smart quotes to ascii quotes -> IDLE menu option to convert non-ascii quotes & other?
messages: + msg337549
stage: test needed
2019-03-07 19:07:12cheryl.sabellasetnosy: + cheryl.sabella
messages: + msg337429
2019-03-07 03:53:57serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg337358
2019-03-07 00:10:03rhettingercreate