Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bot types #7489

Open
Simbiat opened this issue Oct 21, 2023 · 7 comments
Open

Bot types #7489

Simbiat opened this issue Oct 21, 2023 · 7 comments

Comments

@Simbiat
Copy link
Contributor

Simbiat commented Oct 21, 2023

Variable for bots $categories has some ambiguous types:

  • Feed Fetcher, Feed Parser, Feed Reader - what's the difference, really?
  • Read-it-later Service is used only for 2 items both for 1 thing: https://getpocket.com/pocketparser_ua. At the same time description on this page clearly says crawling, so should this not be Crawler?
  • Search tools is used only for 1 item: http://www.shopwiki.com/w/Help:Bot. Again, description clearly states, that this is a crawler, so should this not be Crawler?
  • How does Security search bot differ from Security Checker?
  • How does Service bot differ from Service Agent?
  • And probably the biggest of it all: what's the difference of Search bot from Crawler? I mean, crawling is done by search bots, so this seems to be the same thing.

I am fine with creating PR to harmonize these things a bit, but I think this warrants a proper discussion first.

@liviuconcioiu
Copy link
Collaborator

#5727

@Simbiat
Copy link
Contributor Author

Simbiat commented Oct 22, 2023

Hm, that one did not cover the questions above, in the end, although it did mention multiple feed bots, and it resulted in code for validating categories. I am, essentially, talking about cleaning up the types.

@sgiehl
Copy link
Member

sgiehl commented Oct 30, 2023

I guess we don't have a "clean" definition of categories to use. Feel free to create a PR to clean them up a bit.

@Simbiat
Copy link
Contributor Author

Simbiat commented Oct 30, 2023

I can add this to #7490. Or would a separate PR be better?

@sgiehl
Copy link
Member

sgiehl commented Oct 30, 2023

@Simbiat It's better to have a separate PR, as that makes reviewing easier.

@liviuconcioiu
Copy link
Collaborator

I've come across https://radar.cloudflare.com/traffic/verified-bots, which has a nice classification. Thoughts?

@Simbiat
Copy link
Contributor Author

Simbiat commented Jul 17, 2024

What that page suggests:

  • Academic Research - used only for Internet Archive, and I am not sure it's correct category. To me it would probably be a regular Crawler
  • Accessibility - 3 entries, does make sense for those bots. Probably a valid category, which we can adopt.
  • Advertising & Marketing - based on my knowledge of how these bots work and what they do (which limited to my short time in Smartly.io) I'd say these could be treated similar to Monitoring & Analytics category below.
  • Aggregator - Again, looks like a regular Crawler to me, not sure worth it to have this as separate category.
  • AI Crawler - probably a valid category nowadays, although only 3 entries there. On the other hand "AI" will only imply technology used by the company, not necessarily the purpose of the bot, so regular Crawler could still be fine
  • Feed Fetcher - same that what we have in 3 categories
  • Monitoring & Analytics - looks similar to our Site Monitor
  • Other - has 2 items which could be considered as Webhooks (category below)
  • Page Preview - essentially search bots, and some app-specific ones
  • Search Engine Crawler - same as our Search bot
  • Search Engine Optimization - same as our Search tools or maybe Site Monitor in some cases
  • Security - same as our Security Checker and Security search bot
  • Social Media Marketing - just Brandwatch in the list, which I would consider a regular crawler
  • Webhooks - this feels a bit generic. I would even say that some Page review items could be considered Webhooks as well.

Personally this is what I would do:

  • Add Assistant category, update the bots from CloudFlare's Accessibility bots
  • Benchmark -> move to Inspector
  • Crawler -> keep as is
  • Feed Fetcher -> rename to Aggregator
  • Feed Parser -> move to Aggregator
  • Feed Reader -> move to Aggregator
  • Network Monitor -> move to Inspector
  • Read-it-later Service -> move to Crawler
  • Search bot -> rename to Searcher
  • Search tools -> move to Crawler
  • Security Checker -> move to Inspector
  • Security search bot -> move to Inspector
  • Service Agent -> some can be moved to Inspector, some to Crawler, from a quick glance
  • Service bot -> I'd say Grammarly probably can be treated as Assistant, Vercel - as Inspector, ADmantX probably, too
  • Site Monitor -> move to Inspector
  • Social Media Agent -> mostly image fetchers, essentially, so either Searcher or Crawler
  • Validator -> move to Inspector

So this would leave these categories:

  • Supporter - bots used by various assistive technologies, including, but not limited to text-to-voice, voice-to-text, image-to-text services, translators and editorial tools.
  • Aggregator - bots used by tools aimed at collection and potential summarization of information from pages, including but not limited to feed readers, link or page collectors and summarization tools.
  • Crawler - bots not falling under other categories or related to generic or multi-purpose services.
  • Inspector - bots used by various tools and services aimed at monitoring, inspecting, validating and/or analyzing content or behavior of websites and users' interactions with them, including for security and/or SEO purposes.
  • Searcher - bots used for services related to search, including, but not limited to search engines and social networks.

I also tried thinking of some acronym, but best I and GPT came up with was SCAIS, because it can be pronounced "skies". Not like we need an acronym or need these specific names, of course. But I think they are a good balance between precise and generic.

Any update would require review of all the bots. I do hope, that by the end of year I will finish going through all brands (and submit PR to correct quite a few things there) and start working on bots, and when I do I can adjust their categories as well, of course.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants
@sgiehl @Simbiat @liviuconcioiu and others