Issue37071
This issue tracker has been migrated to GitHub,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2019-05-28 01:23 by htran, last changed 2022-04-11 14:59 by admin. This issue is now closed.
Files | ||||
---|---|---|---|---|
File name | Uploaded | Description | Edit | |
arguments.html | htran, 2019-05-28 01:22 | The example file taken from Blender's documentation |
Messages (3) | |||
---|---|---|---|
msg343724 - (view) | Author: Hoang Duy Tran (htran) | Date: 2019-05-28 01:22 | |
I have been working with some 'difficult' HTML files generated by Sphinx's RST. The following block of text is the RST original content: ---------------------------------------------------- Animation Playback Options ========================== ``-a`` ``<options>`` ``<file(s)>`` Playback ``<file(s)>``, only operates this way when not running in background. ``-p`` ``<sx>`` ``<sy>`` Open with lower left corner at ``<sx>``, ``<sy>``. ``-m`` Read from disk (Do not buffer). ``-f`` ``<fps>`` ``<fps-base>`` Specify FPS to start with. ``-j`` ``<frame>`` Set frame step to ``<frame>``. ``-s`` ``<frame>`` Play from ``<frame>``. ``-e`` ``<frame>`` Play until ``<frame>``. ---------------------------------------------------- This is the HTML block that is generated by Sphinx: ---------------------------------------------------- <section ids="animation-playback-options" names="animation\ playback\ options"><title>Animation Playback Options</title><definition_list><definition_list_item><term><literal>-a</literal> <literal><options></literal> <literal><file(s)></literal></term><definition><paragraph>Playback <literal><file(s)></literal>, only operates this way when not running in background.</paragraph><definition_list><definition_list_item><term><literal>-p</literal> <literal><sx></literal> <literal><sy></literal></term><definition><paragraph>Open with lower left corner at <literal><sx></literal>, <literal><sy></literal>.</paragraph></definition></definition_list_item><definition_list_item><term><literal>-m</literal></term><definition><paragraph>Read from disk (Do not buffer).</paragraph></definition></definition_list_item><definition_list_item><term><literal>-f</literal> <literal><fps></literal> <literal><fps-base></literal></term><definition><paragraph>Specify FPS to start with.</paragraph></definition></definition_list_item><definition_list_item><term><literal>-j</literal> <literal><frame></literal></term><definition><paragraph>Set frame step to <literal><frame></literal>.</paragraph></definition></definition_list_item><definition_list_item><term><literal>-s</literal> <literal><frame></literal></term><definition><paragraph>Play from <literal><frame></literal>.</paragraph></definition></definition_list_item><definition_list_item><term><literal>-e</literal> <literal><frame></literal></term><definition><paragraph>Play until <literal><frame></literal>.</paragraph></definition></definition_list_item></definition_list></definition></definition_list_item></definition_list></section> ---------------------------------------------------- I then use the BeautifulSoup, which uses the HTMLParser, to beautify and parse the HTML document and I've noticed that every incident of data that leads with a "<" and ends with ">", for example: <options> <file(s)> .... has been misunderstood by the HTMLParser's library as a TAG, and then it INVENTS a CLOSED TAGS for it ie. <literal> <options> </options> </literal> and <literal> <file(s)> </file(s)> </literal> which when reversing, ie. turning from HTML back to normal text, muted out the original data, leading to TRUNCATION/LOST of DATA. Here is the content of the beautify generated data, issue lines are marked with '#**************************' to make it easier for you to identify. ---------------------------------------------------- <section ids="animation-playback-options" names="animation\ playback\ options"> <title> Animation Playback Options </title> <definition_list> <definition_list_item> <term> <literal> -a </literal> <literal> <options> #************************** </options> #************************** </literal> <literal> <file(s)> #************************** </file(s)> #************************** </literal> </term> <definition> <paragraph> Playback <literal> <file(s)> #************************** </file(s)> #************************** </literal> , only operates this way when not running in background. </paragraph> <definition_list> <definition_list_item> <term> <literal> -p </literal> <literal> <sx> #************************** </sx> #************************** </literal> <literal> <sy> #************************** </sy> #************************** </literal> </term> <definition> <paragraph> Open with lower left corner at <literal> <sx> #************************** </sx> #************************** </literal> , <literal> <sy> #************************** </sy> #************************** </literal> . </paragraph> </definition> </definition_list_item> <definition_list_item> <term> <literal> -m </literal> </term> <definition> <paragraph> Read from disk (Do not buffer). </paragraph> </definition> </definition_list_item> <definition_list_item> <term> <literal> -f </literal> <literal> <fps> #************************** </fps> #************************** </literal> <literal> <fps-base> #************************** </fps-base> #************************** </literal> </term> <definition> <paragraph> Specify FPS to start with. </paragraph> </definition> </definition_list_item> <definition_list_item> <term> <literal> -j </literal> <literal> <frame/> #************************** </literal> </term> <definition> <paragraph> Set frame step to <literal> <frame/> #************************** </literal> . </paragraph> </definition> </definition_list_item> <definition_list_item> <term> <literal> -s </literal> <literal> <frame/> #************************** </literal> </term> <definition> <paragraph> Play from <literal> <frame/> #************************** </literal> . </paragraph> </definition> </definition_list_item> <definition_list_item> <term> <literal> -e </literal> <literal> <frame/> #************************** </literal> </term> <definition> <paragraph> Play until <literal> <frame/> #************************** </literal> . </paragraph> </definition> </definition_list_item> </definition_list> </definition> </definition_list_item> </definition_list> </section> ---------------------------------------------------- I enclosed the HTML file generated by Sphinx to allow you test this issue with the actual data. Here is the URL of the HTML file: https://docs.blender.org/manual/en/dev/advanced/command_line/arguments.html Kind Regards, Hoang Tran |
|||
msg344125 - (view) | Author: Terry J. Reedy (terry.reedy) * | Date: 2019-05-31 21:21 | |
Please verify with 3.7.3+ and the latest version of Sphinx. Even if there is a problem, Sphinx is not an stdlib package. The problem would only be relevant to this tracker, rather than the Sphinx tracker, if it were due to our customizations or use of Sphinx. |
|||
msg346210 - (view) | Author: Cheryl Sabella (cheryl.sabella) * | Date: 2019-06-21 13:14 | |
Thank you for the report. Looking at the BeautifulSoup source, there is a comment about this scenario: # Unlike other parsers, html.parser doesn't send separate end tag # events for empty-element tags. (It's handled in # handle_startendtag, but only if the original markup looked like # <tag/>.) # # So we need to call handle_endtag() ourselves. Since we # know the start event is identical to the end event, we # don't want handle_endtag() to cross off any previous end # events for tags of this name. HTMLParser itself produces output such as: >>> class MyParser(HTMLParser): ... def handle_starttag(self, tag, attrs): ... print(f'start: {tag}') ... def handle_endtag(self, tag): ... print(f'end: {tag}') ... def handle_data(self, data): ... print(f'data: {data}') ... >>> parser = MyParser() >>> parser.feed('<p><test></p>') start: p start: test end: p My suggestion would be to try a different parser in BeautifulSoup [1] to handle this. Even if we wanted to modify HTMLParser, any such change would probably be backwards incompatible. [1] https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser |
History | |||
---|---|---|---|
Date | User | Action | Args |
2022-04-11 14:59:15 | admin | set | github: 81252 |
2019-06-21 13:14:58 | cheryl.sabella | set | status: open -> closed nosy: + cheryl.sabella messages: + msg346210 resolution: third party stage: resolved |
2019-05-31 21:21:09 | terry.reedy | set | nosy:
+ terry.reedy messages: + msg344125 versions: + Python 3.7, - Python 3.6 |
2019-05-28 05:43:55 | SilentGhost | set | nosy:
+ ezio.melotti |
2019-05-28 01:23:00 | htran | create |