This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: urllib.robotparser does not respect the longest match for the rule
Type: behavior Stage: patch review
Components: Library (Lib) Versions: Python 3.8
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: gallicrooster, matelesecretaire67
Priority: normal Keywords: patch

Created on 2020-01-02 04:14 by gallicrooster, last changed 2022-04-11 14:59 by admin.

Files
File name Uploaded Description Edit
test_robot.py gallicrooster, 2020-01-02 04:14 Simple test with a few test cases.
Pull Requests
URL Status Linked Edit
PR 17794 open gallicrooster, 2020-01-02 05:27
Messages (5)
msg359181 - (view) Author: Andre Burgaud (gallicrooster) * Date: 2020-01-02 04:14
As per the current Robots Exclusion Protocol internet draft, https://tools.ietf.org/html/draft-koster-rep-00#section-3.2. a robot should apply the rules respecting the longest match.

urllib.robotparser relies on the order of the rules in the robots.txt file. Here is the section in the specs:

===================
3.2.  Longest Match

   The following example shows that in the case of a two rules, the
   longest one MUST be used for matching.  In the following case,
   /example/page/disallowed.gif MUST be used for the URI
   example.com/example/page/disallow.gif .

   <CODE BEGINS>
   User-Agent : foobot
   Allow : /example/page/
   Disallow : /example/page/disallowed.gif
   <CODE ENDS> 
===================

I'm attaching a simple test file "test_robot.py"
msg359184 - (view) Author: Andre Burgaud (gallicrooster) * Date: 2020-01-02 05:46
During testing identified a related issue that is fixed by the same sort function implemented to address the longest match rule.

This related problem also addressed by this change takes into account the situation when 2 equivalent rules (same path for allow and disallow) are found in the robotstxt. In such a situation allow should be used: https://tools.ietf.org/html/draft-koster-rep-00#section-2.2.2
msg416871 - (view) Author: matele secretaire (matelesecretaire67) Date: 2022-04-06 14:23
I can't find a documentation about it, but all of the robots.txt checkers I find behave like this. You can test on this site: https://www.st-info.fr/robots.txt, I believe that this is how it's implemented now in most parsers ?
msg416936 - (view) Author: Andre Burgaud (gallicrooster) * Date: 2022-04-07 17:56
Hi Matele,

Thanks for looking into this issue.

I have seen indeed some implementations that were based on the Python implementation and that had the same problems. The Crystal implementation in particular (as far as I remember, as it was a while ago). As a reference, I used the Google implementation https://github.com/google/robotstxt that respects the internet draft https://datatracker.ietf.org/doc/html/draft-koster-rep-00.

The 2 main points are described in section https://datatracker.ietf.org/doc/html/draft-koster-rep-00#section-2.2.2, especially in the following paragraph:

   "To evaluate if access to a URI is allowed, a robot MUST match the
   paths in allow and disallow rules against the URI.  The matching
   SHOULD be case sensitive.  The most specific match found MUST be
   used.  The most specific match is the match that has the most octets.
   If an allow and disallow rule is equivalent, the allow SHOULD be
   used."

1) The most specific match found MUST be used.  The most specific match is the match that has the most octets.
2) If an allow and disallow rule is equivalent, the allow SHOULD be used.

In the robots.txt example you provided, the longest rule is Allow: /wp-admin/admin-ajax.php. Therefore it will take precedence over the other shorter Disallow rule for the sub-directory admin-ajax.php that should be allowed. To achieve that, the sort of the rule should list the Allow rule first.

I'm currently traveling. I'm sorry if my explanations sound a bit limited. If it helps, I can pickup this discussion when I'm back home, after mid-April. In particular, I can run new tests with Python 3.10, since I raised this potential problem a bit more than two years ago and that I may need to refresh my memory :-) 

In the meantime, let me know if there is anything that I could provide to give a clearer background. For example, are you referring to the 2 issues I highlighted above, or is it something else that you are thinking about. Also, could you point me to the other robots checkers that you looked at?

Thanks!

Andre
msg416987 - (view) Author: matele secretaire (matelesecretaire67) Date: 2022-04-08 14:38
Thank you
History
Date User Action Args
2022-04-11 14:59:24adminsetgithub: 83368
2022-04-08 14:38:14matelesecretaire67setmessages: + msg416987
2022-04-07 17:56:33gallicroostersetmessages: + msg416936
2022-04-06 14:23:27matelesecretaire67setnosy: + matelesecretaire67
messages: + msg416871
2020-01-02 05:46:34gallicroostersetmessages: + msg359184
2020-01-02 05:27:39gallicroostersetkeywords: + patch
stage: patch review
pull_requests: + pull_request17227
2020-01-02 04:14:05gallicroostercreate