Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make telemetry opt-out #715

Open
astrojuanlu opened this issue Jun 5, 2024 · 8 comments
Open

Make telemetry opt-out #715

astrojuanlu opened this issue Jun 5, 2024 · 8 comments

Comments

@astrojuanlu
Copy link
Member

Why do we even want telemetry?

The Kedro team uses telemetry to understand product usage and make data-informed decisions that benefit our users. For example, we were able to determine that certain CLI subcommands had very little usage kedro-org/kedro#1293, kedro-org/kedro#3750. The alternatives would have been

  • Conduct expensive user interviews (that we need to do anyway for more complex issues, hence exhausting our user base), or
  • Keep those commands around, bloating the project.

Therefore having telemetry is a low-cost way for us to keep improving Kedro for everyone.

What is wrong with the current telemetry collection process?

At the moment the telemetry collection has two layers of opt-in:

  1. kedro-telemetry needs to be installed. At the moment this happens because we're introducing it in the requirements of our starters:

https://github.com/kedro-org/kedro/blob/bf536d4029d94bb318848b150be78d23fca44fb4/kedro/templates/project/%7B%7B%20cookiecutter.repo_name%20%7D%7D/requirements.txt#L1-L9

However, this significantly skews the data we have. We have anecdotal evidence that many people don't even know what kedro new is, and also teams often have their own templates, starters, ways of working.

In addition, this blocks progress on relocating the optional dependencies of our starters kedro-org/kedro#2519.

  1. The moment kedro-telemetry is installed and certain (not all) Kedro commands are run for the first time, a blocking prompt is presented to the user asking them whether they opt in for telemetry or not. This prompt causes lots of problems in different environments (see Improve running kedro as part of an automated workflow (CI/CD) kedro#1640). Current workarounds, like creating a .telemetry file ahead of time, are finicky because sometimes it's not obvious what the working directory of the commands are. In these cases we have just told users to pip uninstall kedro-telemetry and go on with their lives, hence losing that information from our side.

Effectively, the presence of this telemetry collection mechanism is both giving us biased data and also actively preventing our users to do their work.

Something has to change.

What do we want to change?

After exploring adjacent libraries and projects as part of #510 (comment), we observed that all of them have an opt-out telemetry collection mechanism.

Therefore, we want to converge with the rest of the ecosystem and make Kedro telemetry opt-out as well.

[Waves arms angrily]

We get it. Some developers and users don't like the idea of opt-out telemetry. Defaults matter.

And yet, if we fail to collect such telemetry, we fail to fulfill our goal of continuing to improve Kedro in a cost-effective way, hence all Kedro users are negatively impacted as a result.

As such, we have taken measures in the past few months to reduce the amount of data we collect:

This is reflected in our telemetry collection policy https://docs.kedro.org/en/0.19.6/configuration/telemetry.html and we still fully stand by it:

This data is collected with the sole purpose of improving Kedro by understanding feature usage. Importantly, we do not store personal information about you or sensitive data from your project, and this process is never utilized for marketing or promotional purposes. Participation in this program is optional, and Kedro will continue working as normal if you opt-out.

Therefore we're committed to storing the minimal amount of information possible, have none of that be personal information (not even IP addresses), make Kedro work exactly the same without it, and offer even more ways to opt out.

We also considered that we had to write all this for full transparency with our users.

What's next?

We are looking into ways to make telemetry collection opt-out, which means: it will be enabled by default for all Kedro Framework projects.

This means that, ideally, anyone who does pip install kedro and performs a kedro run ought to see a message like this:

Kedro is sending anonymous usage data with the sole purpose of improving the product. No personal data or IP addresses are stored on our side. 
If you want to opt out, set the `KEDRO_DISABLE_TELEMETRY` or `DO_NOT_TRACK` environment variables, or create a  `.telemetry` file in the current working directory with the contents `consent: false`.
Read more at https://docs.kedro.org/en/latest/configuration/telemetry.html

Notice the addition of the KEDRO_DISABLE_TELEMETRY and DO_NOT_TRACK environment variables.

@tynandebold
Copy link
Member

tynandebold commented Jun 5, 2024

Great write up, and it's the decision I hoped for a very long time the team would come to! Excited to see this (hopefully) move forward.

@antonymilne
Copy link
Contributor

Fully agree with @tynandebold - very happy to see this, and it's a great writeup 👍

Just to understand, the move from hashing username/project name to UUIDs is because it's even more anonymous? Since, in theory anyway, one could brute-force hashed data to reverse engineer the unhashed username?

@astrojuanlu
Copy link
Member Author

Just to understand, the move from hashing username/project name to UUIDs is because it's even more anonymous? Since, in theory anyway, one could brute-force hashed data to reverse engineer the unhashed username?

Correct. Our hash function is public and fast, so potentially the data is vulnerable to dictionary attacks.

@DimedS
Copy link
Contributor

DimedS commented Jun 6, 2024

Great summary and proposal, @astrojuanlu! I fully agree that telemetry should be integrated into Kedro, rather than being a standalone library. Additionally, I want to point out that telemetry is currently not working with the kedro new command. I believe the new consent system should address this issue, when commands are executed outside of a Kedro project folder.

@merelcht
Copy link
Member

Notice the addition of the KEDRO_DISABLE_TELEMETRY and DO_NOT_TRACK environment variables

What's the purpose of each of these environment variables? Can't we just add 1?

@astrojuanlu
Copy link
Member Author

DO_NOT_TRACK is an attempt at standardising a single environment variable name that many tools could adopt, so that end users don't have to opt out of each and every one of them https://consoledonottrack.com/. It's inspired by the W3C Tracking Preference Expression https://www.w3.org/TR/tracking-dnt/

On the other hand, there might be people that care specifically about Kedro telemetry, or might choose the explicit setting for whatever reason, hence having KEDRO_DISABLE_TELEMETRY is also needed.

Inspiration:

This is totally optional, completely anonymous, and can be disabled via CLI (turbo telemetry disable), or multiple environment variables (DO_NOT_TRACK=1 or TURBO_TELEMETRY_DISABLED=1.

Either TILT_DISABLE_ANALYTICS or DO_NOT_TRACK.

Either DO_NOT_TRACK=1 or a custom --disable-telemetry flag.

@merelcht
Copy link
Member

Is there anything left to do for this ticket? @astrojuanlu @DimedS

@DimedS
Copy link
Contributor

DimedS commented Aug 23, 2024

I believe the viz telemetry update mentioned in the ticket above is currently in viz's inbox. Additionally, if time permits, we should consider adding more integration tests: #770. Also, this telemetry-related ticket is quite important: #794

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: No status
Status: Current
Development

No branches or pull requests

5 participants