Skip to content

Conversation

@andreburgaud
Copy link

@andreburgaud andreburgaud commented Jan 2, 2020

  • Added a sort function to sort the rules according to longest match
  • Took into account equivalent allow and disallow rules to result in allow

https://bugs.python.org/issue39187

@the-knights-who-say-ni
Copy link

Hello, and thanks for your contribution!

I'm a bot set up to make sure that the project can legally accept this contribution by verifying everyone involved has signed the PSF contributor agreement (CLA).

CLA Missing

Our records indicate the following people have not signed the CLA:

@andreburgaud

For legal reasons we need all the people listed to sign the CLA before we can look at your contribution. Please follow the steps outlined in the CPython devguide to rectify this issue.

If you have recently signed the CLA, please wait at least one business day
before our records are updated.

You can check yourself to see if the CLA has been received.

Thanks again for the contribution, we look forward to reviewing it!

@andreburgaud andreburgaud changed the title bpo-39817: robotparser does not respect longest match bpo-39187: robotparser does not respect longest match Jan 2, 2020
Copy link
Member

@serhiy-storchaka serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your PR, @andreburgaud.

Unfortunately, this approach cannot be used when there are special characters * or $ in the patterns. But they are currently not supported in robotparser, so at least we will get more correct behavior for robots.txt files that don't contain them.



class LongestMatchUserAgentTest(BaseRobotTest, unittest.TestCase):
# https://tools.ietf.org/html/draft-koster-rep-00#section-3.2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This document is now available as https://datatracker.ietf.org/doc/html/rfc9309. Please update all links.

But this test passes also if all allow rules have higher priority than any disallow rule. So please add also reversed rules (short allow, long disallow) here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the suggestion @serhiy-storchaka, Working on it.

bad = ['/folder1/anotherfile.html']


class LongestMatchUserAgentTest(BaseRobotTest, unittest.TestCase):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test replaces GoogleURLOrderingTest. I think that GoogleURLOrderingTest should now be removed, because it lost its meaning.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I'm removing it.

Disallow: /folder1/
Allow: /folder1/
"""
good = ['/folder1/myfile.html', '/folder1', '/folder1']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double '/folder1'.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stupid mistake on my end. Thx. Removing the unnecessary double 'folder1'.

@Sanel0101 Sanel0101 mentioned this pull request Nov 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants