-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
explore activity tracking via metrics #151
Comments
Would the implementatin you consider enable the proxy class to report activity via I'm asking because I'd like to reference an issue to track the status of traefik-proxy's current inability to report network activity on its routes from jupyterhub-idle-culler's readme. |
Yes, that's exactly what I'm thinking, and specifically for the mybinder.org case where I looked into this a bit today, and was able to collect the
So, to me, that mostly means:
The first would make the whole feature unavailable in the one place we want it. The upside of 2 is prometheus tends to be running where we want this. The downside is that these metrics don't really want to be public - exposing the metrics we need exposes usernames and URLs of currently active servers. That leads one to choose an authenticated prometheus instance, which e.g. mybinder.org doesn't have, unless we change how we name routers to be opaque (they still need to be deterministic, but not reversible, so a hash function would be valid, though it would lead to very ugly metric labels). And even opaque, it reveals some (very coarse) data about individual server behavior, even if the individuals aren't immediately known (they might be deduced via other means). The other upside is we wouldn't need to handle deltas, since we can use
which means we need to implement storing measurements. If we were talking to prometheus instead of scraping prometheus metrics, we wouldn't need any state, and could do much simpler We could elect to run a dedicated prometheus instance just for these metrics. Not the full prometheus operator, but a dedicated instance with:
That's starting to complicate deployment quite a bit, I imagine, since the prometheus would still need to be configured to discover traefik. I'm not sure what the best path is right now. Since this is such a specific case (essentially only anonymous binderhub), I am starting to be inclined to implement this against a prometheus server, and say it only really exists for anonymous binderhub, where active URLs aren't meaningful or identifying (and are probably already in prometheus somewhere else). |
This is really a special feature for anonymous BinderHub, so another option for this would be to implement this directly in the binderhub chart:
It makes a certain amount of sense to do here because we already have the api calls to talk to traefik and the configuration required, but all the trade offs of enabling this only make sense when you're not launching jupyterhub-singleuser, which is exclusively anonymous binderhub in practice. |
Proposed change
The only feature CHP has that traefik does not is network-level activity tracking. This isn't generally critical, as activity tracking is now published from the single-user server (this was added mainly to enable traefik in the first place).
The one situation where this is required is unauthenticated BinderHub, where single-user servers do not report their activity because they are not actually jupyterhub servers. The result is that BinderHub without auth cannot enable idle-culling with a traefik proxy, because all servers are always considered idle if they don't report any activity.
We don't have the same hooks for traefik that we do, but traefik does have metrics, which may provide good enough information.
Alternative options
Who would use this feature?
JupyterHub deployments that wish to use traefik with a default BinderHub (or any other alternative single-user server implementation that may not implement internal activity tracking). For example: mybinder.org.
(Optional): Suggest a solution
If we scrape a traefik metrics endpoint, e.g. prometheus, I believe we can get a low-resolution 'did anything happen' metric, which ought to be good enough. I think we can infer that if any of the router metrics for a given server have changed, there has been activity since the last check.
This should be off by default, because it is only really useful in the BinderHub case (or similar), and may be potentially costly.
The text was updated successfully, but these errors were encountered: