Message 359181 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	gallicrooster
Recipients	gallicrooster
Date	2020-01-02.04:14:05
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1577938445.91.0.743054392693.issue39187@roundup.psfhosted.org>
In-reply-to

Content
As per the current Robots Exclusion Protocol internet draft, https://tools.ietf.org/html/draft-koster-rep-00#section-3.2. a robot should apply the rules respecting the longest match. urllib.robotparser relies on the order of the rules in the robots.txt file. Here is the section in the specs: =================== 3.2. Longest Match The following example shows that in the case of a two rules, the longest one MUST be used for matching. In the following case, /example/page/disallowed.gif MUST be used for the URI example.com/example/page/disallow.gif . <CODE BEGINS> User-Agent : foobot Allow : /example/page/ Disallow : /example/page/disallowed.gif <CODE ENDS> =================== I'm attaching a simple test file "test_robot.py"

As per the current Robots Exclusion Protocol internet draft, https://tools.ietf.org/html/draft-koster-rep-00#section-3.2. a robot should apply the rules respecting the longest match.

urllib.robotparser relies on the order of the rules in the robots.txt file. Here is the section in the specs:

===================
3.2.  Longest Match

   The following example shows that in the case of a two rules, the
   longest one MUST be used for matching.  In the following case,
   /example/page/disallowed.gif MUST be used for the URI
   example.com/example/page/disallow.gif .

   <CODE BEGINS>
   User-Agent : foobot
   Allow : /example/page/
   Disallow : /example/page/disallowed.gif
   <CODE ENDS> 
===================

I'm attaching a simple test file "test_robot.py"

History
Date	User	Action	Args
2020-01-02 04:14:05	gallicrooster	set	recipients: + gallicrooster
2020-01-02 04:14:05	gallicrooster	set	messageid: <1577938445.91.0.743054392693.issue39187@roundup.psfhosted.org>
2020-01-02 04:14:05	gallicrooster	link	issue39187 messages
2020-01-02 04:14:05	gallicrooster	create