Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non optimal generated regexp #249

Open
Zabrane opened this issue Jun 12, 2024 · 0 comments
Open

Non optimal generated regexp #249

Zabrane opened this issue Jun 12, 2024 · 0 comments

Comments

@Zabrane
Copy link

Zabrane commented Jun 12, 2024

Hi @pemistahl and many thanks for this great piece of software.

I'd like to report a little issue which I'm sure can easily be fixed.

$ grex --version                                                                                                                                                                                                       
grex 1.4.5

$ cat bots.txt
baiduspider
bingbot
duckduckgo
googlebot
yandexbot

$ grex --no-anchors -c -i -f bots.txt
(?i)(?:baiduspider|duckduckgo|(?:google|bing)bot|yandexbot)

This is what i was expecting to get:

$ grex --no-anchors -c -i -f bots.txt
(?i)(?:baiduspider|duckduckgo|(?:google|bing|yandex)bot)

yandexbot shares the same suffix bot with googlebot and bingbot.

Interestingly, when testing with a reduced list of bots all sharing the same suffix, the suffix bot is found but still a non optimal regex is returned:

$ cat bots.txt
bingbot
googlebot
yandexbot

$ grex --no-anchors -c -i -f bots.txt
(?i)(?:(?:google|bing)|yandex)bot

This is what i was expecting to get:

$ grex --no-anchors -c -i -f bots.txt
(?i)((?:google|bing|yandex)bot)

Many thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant