bpo-39187: robotparser does not respect longest match #17794

andreburgaud · 2020-01-02T05:21:31Z

Added a sort function to sort the rules according to longest match
Took into account equivalent allow and disallow rules to result in allow

https://bugs.python.org/issue39187

the-knights-who-say-ni · 2020-01-02T05:21:35Z

Hello, and thanks for your contribution!

I'm a bot set up to make sure that the project can legally accept this contribution by verifying everyone involved has signed the PSF contributor agreement (CLA).

CLA Missing

Our records indicate the following people have not signed the CLA:

@andreburgaud

For legal reasons we need all the people listed to sign the CLA before we can look at your contribution. Please follow the steps outlined in the CPython devguide to rectify this issue.

If you have recently signed the CLA, please wait at least one business day
before our records are updated.

You can check yourself to see if the CLA has been received.

Thanks again for the contribution, we look forward to reviewing it!

serhiy-storchaka

Thank you for your PR, @andreburgaud.

Unfortunately, this approach cannot be used when there are special characters * or $ in the patterns. But they are currently not supported in robotparser, so at least we will get more correct behavior for robots.txt files that don't contain them.

serhiy-storchaka · 2025-09-05T15:22:24Z

Lib/test/test_robotparser.py



+class LongestMatchUserAgentTest(BaseRobotTest, unittest.TestCase):
+    # https://tools.ietf.org/html/draft-koster-rep-00#section-3.2


This document is now available as https://datatracker.ietf.org/doc/html/rfc9309. Please update all links.

But this test passes also if all allow rules have higher priority than any disallow rule. So please add also reversed rules (short allow, long disallow) here.

Thank you for the suggestion @serhiy-storchaka, Working on it.

serhiy-storchaka · 2025-09-05T15:28:40Z

Lib/test/test_robotparser.py

    bad = ['/folder1/anotherfile.html']


+class LongestMatchUserAgentTest(BaseRobotTest, unittest.TestCase):


This test replaces GoogleURLOrderingTest. I think that GoogleURLOrderingTest should now be removed, because it lost its meaning.

Agreed. I'm removing it.

serhiy-storchaka · 2025-09-05T15:37:07Z

Lib/test/test_robotparser.py

+Disallow: /folder1/
+Allow: /folder1/
+    """
+    good = ['/folder1/myfile.html', '/folder1', '/folder1']


Double '/folder1'.

Stupid mistake on my end. Thx. Removing the unnecessary double 'folder1'.

issue39817: robotparser does not respect longest match

fe8a448

andreburgaud requested a review from berkerpeksag as a code owner January 2, 2020 05:21

the-knights-who-say-ni added the CLA not signed label Jan 2, 2020

bedevere-bot added the awaiting review label Jan 2, 2020

andreburgaud changed the title ~~bpo-39817: robotparser does not respect longest match~~ bpo-39187: robotparser does not respect longest match Jan 2, 2020

📜🤖 Added by blurb_it.

18a3cb1

the-knights-who-say-ni added CLA signed and removed CLA not signed labels Jan 2, 2020

ezio-melotti removed the CLA signed label Jul 13, 2022

andreburgaud mannequin mentioned this pull request Sep 4, 2025

urllib.robotparser does not respect the longest match for the rule #83368

Open

serhiy-storchaka reviewed Sep 5, 2025

View reviewed changes

Sanel0101 mentioned this pull request Nov 24, 2025

spam #141889

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

bpo-39187: robotparser does not respect longest match #17794

bpo-39187: robotparser does not respect longest match #17794

Uh oh!

andreburgaud commented Jan 2, 2020 •

edited by bedevere-bot

Loading

Uh oh!

the-knights-who-say-ni commented Jan 2, 2020

Uh oh!

serhiy-storchaka left a comment

Uh oh!

serhiy-storchaka Sep 5, 2025

Uh oh!

andreburgaud Nov 24, 2025

Uh oh!

serhiy-storchaka Sep 5, 2025

Uh oh!

andreburgaud Nov 24, 2025

Uh oh!

serhiy-storchaka Sep 5, 2025

Uh oh!

andreburgaud Nov 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants



		class LongestMatchUserAgentTest(BaseRobotTest, unittest.TestCase):
		# https://tools.ietf.org/html/draft-koster-rep-00#section-3.2

		bad = ['/folder1/anotherfile.html']


		class LongestMatchUserAgentTest(BaseRobotTest, unittest.TestCase):

Uh oh!

bpo-39187: robotparser does not respect longest match #17794

Are you sure you want to change the base?

bpo-39187: robotparser does not respect longest match #17794

Uh oh!

Conversation

andreburgaud commented Jan 2, 2020 • edited by bedevere-bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

the-knights-who-say-ni commented Jan 2, 2020

CLA Missing

Uh oh!

serhiy-storchaka left a comment

Choose a reason for hiding this comment

Uh oh!

serhiy-storchaka Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

andreburgaud Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

serhiy-storchaka Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

andreburgaud Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

serhiy-storchaka Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

andreburgaud Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

andreburgaud commented Jan 2, 2020 •

edited by bedevere-bot

Loading