Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add "bot" (flags) detection #46

Open
2 tasks
DanielSiepmann opened this issue Sep 28, 2020 · 9 comments
Open
2 tasks

Add "bot" (flags) detection #46

DanielSiepmann opened this issue Sep 28, 2020 · 9 comments
Assignees
Labels
enhancement New feature or request funding This issue can be funded but won't make it otherwise

Comments

@DanielSiepmann
Copy link
Owner

DanielSiepmann commented Sep 28, 2020

Right now the extension is not aware of bots.
It would make sense to add a flag to all records that define whether the tracking was triggered by a bot.

Those should then be allowed to be excluded from widgets. That way widgets can be created for humans as well as bots.
This way one can check which pages are called by humans and by bots, as well as which operating systems are used by bots and humans.

The existing rule should remove all the bot logic.
Instead a new option should be introduced which contains the logic.

Integrated into the existing update logic, existing records can be marked as bots afterwards. No existing data should be lost.

Open Todos:

  • Current inconsistency: I sometimes use "flag" sometimes "tag, this should be streamlined.
  • Fix typo Unkown
@DanielSiepmann DanielSiepmann added the enhancement New feature or request label Sep 28, 2020
@DanielSiepmann DanielSiepmann self-assigned this Sep 28, 2020
@DanielSiepmann
Copy link
Owner Author

That feature can be abstracted and combined with existing operating system feature.
Instead of adding detection over detection, combined with dedicated database fields, here is another approach:
Use tags. Integrators should be able to define arbitrary tags, e.g. in configuration as keys. Each tag has a rule as value, which receives the same as the "should track rule".
Everyone can define the tags important for a project. Tags can be "bot:yes" or "bot:no" as well as "os:windows", "os:unix", etc.

We wouldn't need to add feature over feature, but implement a single flexible feature.
Widgets should be able to filter by those tags. E.g. show only page views with tag "bot:yes".

@DanielSiepmann
Copy link
Owner Author

Blocks proper tests for this: TYPO3/testing-framework#256

DanielSiepmann added a commit that referenced this issue Aug 16, 2021
Ensures pageview and recordview are tracked as expected.

Relates: #46
DanielSiepmann added a commit that referenced this issue Aug 16, 2021
Ensures pageview and recordview are tracked as expected.

Relates: #46
DanielSiepmann added a commit that referenced this issue Aug 16, 2021
Ensures pageview and recordview are tracked as expected.

Relates: #46
@DanielSiepmann DanielSiepmann added the funding This issue can be funded but won't make it otherwise label Nov 25, 2021
@DanielSiepmann
Copy link
Owner Author

DanielSiepmann commented Nov 25, 2021

Already worked on this, see commits and Branch feature/46-add-flags-feature (as well as first attempt in branch feature/bot-support which won't make it due to more flexible approach in new branch). Blocker right now: This will be breaking and data migration takes way to long right now.

Further work (funding) needed to provide a smother migration.

@jonaseberle
Copy link

What do you think about requiring/suggesting https://github.com/JayBizzle/Crawler-Detect and making it available in the Expression Language as detectCrawler.isBot() or similar? They are pretty quick with adding new bots' user-agents.

Absolutely wonderful project by the way :) Thank you!

@DanielSiepmann
Copy link
Owner Author

DanielSiepmann commented Nov 29, 2021

I've integrated https://packagist.org/packages/matomo/device-detector within the feature branch already.
There is currently no plan to add it to the expression language, as the concept will change.

All requests will be tragged but can have arbitrary trags. E.g. a feature flag "isBot:yes" or flag "botName:Google". See: https://github.com/DanielSiepmann/tracking/blob/feature/46-add-flags-feature/Documentation/Changelog/2.0.0.rst#features and https://github.com/DanielSiepmann/tracking/blob/feature/46-add-flags-feature/Documentation/Tags.rst

Widgets will be extended to allow filtering by tags. That way the extension is not limited to anything, e.g. bots, but open for anything. Developers can add further extractors to extract tags from request which will be attached as well. Integrators can then create fine grained widgets, e.g. top bots, top pages by bots, etc.
Existing information like operating systems are also moved to those tags via extractors.

Developers are also able to replace extractors, e.g. if you prefer another crawler library.
Current extractor allows to add further yaml files to matomo bot detection, e.g. if you have very specific bots from 3rd parties or your own.

The only issue left is a proper migration which doesn't take ages on large datasets. And proper documentation, especially on how to migrate the whole yaml setup.

@DanielSiepmann DanielSiepmann changed the title Add "bot" detection Add "bot" (flags) detection Jan 6, 2022
@DanielSiepmann
Copy link
Owner Author

I worked on a command which will migrate a configurable amount of record each run.
Still that would leave the dashboard in a broken state until all data is migrated. Not sure if that is sufficient. Maybe there should be a transition phase where both ways are supported. That way each integrator is free to use the new feature or keep old behaviour. But he can already use the new one and turn on migration and define a transition phase on its own.

On the other hand … its up to everyone to stay on v1 and we could just release v2 with the migration path and new features. Maybe people give it a try and provide feedback if that approach doesn't work … we then could still provide v2.x which provides compatibility with both and allow a smoother transition.

@jonaseberle
Copy link

I am not sure I understand the problem.
Is it about that "unprocessed" records would be visible in the Dashboard, just be untagged until the "extractors" have run? Then I would say this is absolutely no problem.

@DanielSiepmann
Copy link
Owner Author

DanielSiepmann commented Jan 26, 2022

Yes, that's the "big" problem I see.

Furthermore, one has to adjust the Services.yml configuration. But that shouldn't be a big problem. Default shipped configuration will be adapted, and I'll add a proper documentation for migration.

Let's see when I find time to finish. I'll then use the new version on my own site for a while before I'll merge and release the new version.

@DanielSiepmann
Copy link
Owner Author

We need to ensure that existing ignores are kept, e.g. #105

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request funding This issue can be funded but won't make it otherwise
Projects
None yet
Development

No branches or pull requests

2 participants