This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: HTMLParser mistakenly inventing new tags while parsing
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.7
process
Status: closed Resolution: third party
Dependencies: Superseder:
Assigned To: Nosy List: cheryl.sabella, ezio.melotti, htran, terry.reedy
Priority: normal Keywords:

Created on 2019-05-28 01:23 by htran, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
arguments.html htran, 2019-05-28 01:22 The example file taken from Blender's documentation
Messages (3)
msg343724 - (view) Author: Hoang Duy Tran (htran) Date: 2019-05-28 01:22
I have been working with some 'difficult' HTML files generated by Sphinx's RST. The following block of text is the RST original content:

----------------------------------------------------
Animation Playback Options
==========================

``-a`` ``<options>`` ``<file(s)>``
   Playback ``<file(s)>``, only operates this way when not running in background.

   ``-p`` ``<sx>`` ``<sy>``
      Open with lower left corner at ``<sx>``, ``<sy>``.
   ``-m``
      Read from disk (Do not buffer).
   ``-f`` ``<fps>`` ``<fps-base>``
      Specify FPS to start with.
   ``-j`` ``<frame>``
      Set frame step to ``<frame>``.
   ``-s`` ``<frame>``
      Play from ``<frame>``.
   ``-e`` ``<frame>``
      Play until ``<frame>``.
----------------------------------------------------

This is the HTML block that is generated by Sphinx:

----------------------------------------------------
<section ids="animation-playback-options" names="animation\ playback\ options"><title>Animation Playback Options</title><definition_list><definition_list_item><term><literal>-a</literal> <literal><options></literal> <literal><file(s)></literal></term><definition><paragraph>Playback <literal><file(s)></literal>, only operates this way when not running in background.</paragraph><definition_list><definition_list_item><term><literal>-p</literal> <literal><sx></literal> <literal><sy></literal></term><definition><paragraph>Open with lower left corner at <literal><sx></literal>, <literal><sy></literal>.</paragraph></definition></definition_list_item><definition_list_item><term><literal>-m</literal></term><definition><paragraph>Read from disk (Do not buffer).</paragraph></definition></definition_list_item><definition_list_item><term><literal>-f</literal> <literal><fps></literal> <literal><fps-base></literal></term><definition><paragraph>Specify FPS to start with.</paragraph></definition></definition_list_item><definition_list_item><term><literal>-j</literal> <literal><frame></literal></term><definition><paragraph>Set frame step to <literal><frame></literal>.</paragraph></definition></definition_list_item><definition_list_item><term><literal>-s</literal> <literal><frame></literal></term><definition><paragraph>Play from <literal><frame></literal>.</paragraph></definition></definition_list_item><definition_list_item><term><literal>-e</literal> <literal><frame></literal></term><definition><paragraph>Play until <literal><frame></literal>.</paragraph></definition></definition_list_item></definition_list></definition></definition_list_item></definition_list></section>
----------------------------------------------------

I then use the BeautifulSoup, which uses the HTMLParser, to beautify and parse the HTML document and I've noticed that every incident of data that leads with a "<" and ends with ">", for example:

<options>
<file(s)>
....

has been misunderstood by the HTMLParser's library as a TAG, and then it INVENTS a CLOSED TAGS for it

ie.

      <literal>
       <options>
       </options>
      </literal>

and

       <literal>
        <file(s)>
        </file(s)>
       </literal>

which when reversing, ie. turning from HTML back to normal text, muted out the original data, leading to TRUNCATION/LOST of DATA.

Here is the content of the beautify generated data, issue lines are marked with '#**************************' to make it easier for you to identify.

----------------------------------------------------
  <section ids="animation-playback-options" names="animation\ playback\ options">
   <title>
    Animation Playback Options
   </title>
   <definition_list>
    <definition_list_item>
     <term>
      <literal>
       -a
      </literal>
      <literal>
       <options> #**************************
       </options> #**************************
      </literal>
      <literal>
       <file(s)> #**************************
       </file(s)> #**************************
      </literal>
     </term>
     <definition>
      <paragraph>
       Playback
       <literal>
        <file(s)> #**************************
        </file(s)> #**************************
       </literal>
       , only operates this way when not running in background.
      </paragraph>
      <definition_list>
       <definition_list_item>
        <term>
         <literal>
          -p
         </literal>
         <literal>
          <sx> #**************************
          </sx> #**************************
         </literal>
         <literal>
          <sy> #**************************
          </sy> #**************************
         </literal>
        </term>
        <definition>
         <paragraph>
          Open with lower left corner at
          <literal>
           <sx> #**************************
           </sx> #**************************
          </literal>
          ,
          <literal>
           <sy> #**************************
           </sy> #**************************
          </literal>
          .
         </paragraph>
        </definition>
       </definition_list_item>
       <definition_list_item>
        <term>
         <literal>
          -m
         </literal>
        </term>
        <definition>
         <paragraph>
          Read from disk (Do not buffer).
         </paragraph>
        </definition>
       </definition_list_item>
       <definition_list_item>
        <term>
         <literal>
          -f
         </literal>
         <literal>
          <fps> #**************************
          </fps> #**************************
         </literal>
         <literal>
          <fps-base> #**************************
          </fps-base> #**************************
         </literal>
        </term>
        <definition>
         <paragraph>
          Specify FPS to start with.
         </paragraph>
        </definition>
       </definition_list_item>
       <definition_list_item>
        <term>
         <literal>
          -j
         </literal>
         <literal>
          <frame/> #**************************
         </literal>
        </term>
        <definition>
         <paragraph>
          Set frame step to
          <literal>
           <frame/> #**************************
          </literal>
          .
         </paragraph>
        </definition>
       </definition_list_item>
       <definition_list_item>
        <term>
         <literal>
          -s
         </literal>
         <literal>
          <frame/> #**************************
         </literal>
        </term>
        <definition>
         <paragraph>
          Play from
          <literal>
           <frame/> #**************************
          </literal>
          .
         </paragraph>
        </definition>
       </definition_list_item>
       <definition_list_item>
        <term>
         <literal>
          -e
         </literal>
         <literal>
          <frame/> #**************************
         </literal>
        </term>
        <definition>
         <paragraph>
          Play until
          <literal>
           <frame/> #**************************
          </literal>
          .
         </paragraph>
        </definition>
       </definition_list_item>
      </definition_list>
     </definition>
    </definition_list_item>
   </definition_list>
  </section>
----------------------------------------------------
I enclosed the HTML file generated by Sphinx to allow you test this issue with the actual data.

Here is the URL of the HTML file:

https://docs.blender.org/manual/en/dev/advanced/command_line/arguments.html


Kind Regards,
Hoang Tran
msg344125 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2019-05-31 21:21
Please verify with 3.7.3+ and the latest version of Sphinx.  Even if there is a problem, Sphinx is not an stdlib package.  The problem would only be relevant to this tracker, rather than the Sphinx tracker, if it were due to our customizations or use of Sphinx.
msg346210 - (view) Author: Cheryl Sabella (cheryl.sabella) * (Python committer) Date: 2019-06-21 13:14
Thank you for the report.

Looking at the BeautifulSoup source, there is a comment about this scenario:
            # Unlike other parsers, html.parser doesn't send separate end tag
            # events for empty-element tags. (It's handled in
            # handle_startendtag, but only if the original markup looked like
            # <tag/>.)
            #
            # So we need to call handle_endtag() ourselves. Since we
            # know the start event is identical to the end event, we
            # don't want handle_endtag() to cross off any previous end
            # events for tags of this name.


HTMLParser itself produces output such as:
>>> class MyParser(HTMLParser):
...     def handle_starttag(self, tag, attrs):
...         print(f'start: {tag}')
...     def handle_endtag(self, tag):
...         print(f'end: {tag}')
...     def handle_data(self, data):
...         print(f'data: {data}')
...
>>> parser = MyParser()
>>> parser.feed('<p><test></p>')
start: p
start: test
end: p

My suggestion would be to try a different parser in BeautifulSoup [1] to handle this.  Even if we wanted to modify HTMLParser, any such change would probably be backwards incompatible.

[1] https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser
History
Date User Action Args
2022-04-11 14:59:15adminsetgithub: 81252
2019-06-21 13:14:58cheryl.sabellasetstatus: open -> closed

nosy: + cheryl.sabella
messages: + msg346210

resolution: third party
stage: resolved
2019-05-31 21:21:09terry.reedysetnosy: + terry.reedy

messages: + msg344125
versions: + Python 3.7, - Python 3.6
2019-05-28 05:43:55SilentGhostsetnosy: + ezio.melotti
2019-05-28 01:23:00htrancreate