Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collect simple, anonymous, constructive analytics #4

Open
niloc132 opened this issue Mar 4, 2023 · 2 comments
Open

Collect simple, anonymous, constructive analytics #4

niloc132 opened this issue Mar 4, 2023 · 2 comments

Comments

@niloc132
Copy link
Member

niloc132 commented Mar 4, 2023

When configured as documented, logs are collected by nginx-proxy, with the standard NCSA log details plus the hostname of the vhost (real log line from production, with some specifics redacted):

nginx.1    | www.gwtproject.org 1.2.3.4 - - [04/Mar/2023:14:00:00 +0000] "GET /javadoc/latest/com/google/gwt/core/ext/Linker.html HTTP/1.1" 200 25683 "-" "User agent string logged here"

In order to better serve the GWT project itself, and respect the privacy of users, we don't need all of this information, but should still collect at least some parts of it. In the interest of transparency, we probably want to publish at least coarse details of the project, so that we know what resources are being used, volume of requests, where we're seeing 404s or redirects, etc. Optimizing for traffic isn't the goal, these should be used for spotting bugs, ensuring resources are not abused, curiosity, etc.

Breaking down the example log line from above, and considering what would be helpful and how:

  • nginx.1 | - this prefix is added by the forego process. For better or worse, there are terminal control characters that add color here which need to be accounted for. The nginx-proxy container also logs docker-gen details, when changes take place that require the nginx configuration to be updated, restarted. Should be filtered to only include nginx contents.
  • www.gwtproject.org - the Host/:authority headers in the request, indicating which host has a request coming to it. Should be filtered to only include gwtproject domain names, and eventual output should be grouped by the same domain name.
  • 1.2.3.4 - client's IP address. We should probably remove this entirely, I can't think of a purpose it would serve, except to link pages loaded within a session by a user, to see how they explore, or what page was linking to a 404 resource. I'm inclined to err on the side of skipping it entirely until we have a better reason to keep it and sufficiently anonymize it.
  • - - rfc 1413 user identity, can be ignored since we won't collect this
  • - - user id of the client, can be ignored since we won't collect this
  • [04/Mar/2023:14:00:00 +0000] timestamp of the request. My first thought is that we likely only need to bucket these into (for example) 1/5/60 minute intervals rather than publish/record exact timestamps.
  • "GET /javadoc/latest/com/google/gwt/core/ext/Linker.html HTTP/1.1" - the normalized request line, indicating what http method, what path was requested, and what http version. This is all probably important to keep.
  • 200 - the http status code of the response. Important to keep.
  • 25683 - size in bytes of the response body. In our case, with no dynamic resources, this probably is not very informative, as the path itself should make it clear what the size was, but it also doesn't seem to have a downside to provide (and lets any analysis avoid actually needing to join against the current size of all files)
  • "-" - the referrer to this resource. Browsers are increasingly strict about when this is sent, but we may still want to filter so that only other gwtproject domains are listed, so that we can work backwards to find out where 404s or the like are coming from. In theory there could be value in seeing what is linking to our docs, in practice many browsers already omit this, and it is easy to "game".
  • "User agent string logged here" - user agent string (omitted here, obviously). It might be constructive to either flag or filter out "bot" user agents, not because we imagine that this will make our measurements actually accurate, but at least to avoid confusing "poor bot behavior" with "actual issues that users are encountering". It might also be good to strip this out entirely from the final output, or to null out this value unless the user-agent string is (for example) in the top 5% of observed strings. Very unscientific analysis:
    • For reference, of the last 100000 hits (about 26 hrs of data) to gwtproject, there were 1201 unique user agent strings
    • 625 calls (27th most popular) had no user agent defined
    • Of the top 10, three were explicitly bots, two were different Java 17 versions (eclipse plugin updates, dtd checks, and terms.html), three were on windows, one mac, and one linux.
    • I put very little stock in this, as the 11th most popular was IE8 on windows 7.

The first step is probably to replace the NCSA log directive currently in use with something more specific (removing fields we don't want to use), then putting some filtering/batching/bucketing downstream, then publishing results.

@tbroyer
Copy link

tbroyer commented Mar 5, 2023

Wrt the user agent, maybe you could log the Sec-CH-UA and Sec-CH-UA-Platform if present, and fallback to User-Agent otherwise (I have no idea what one can do with nginx; I see there are modules/plugins to let you extract values from the UA, so you can possibly log only parts of UA to make it easier to process the logs later, e.g. with the _n_th column being the Sec-CH-UA or equivalent, and directly Googlebot, Yandex for bots, Java, etc.)

Wrt the timestamp, bucketing by hour should be enough. If you want to detect spikes in trafic (possible DDOS or whatever), sure go do it (bucket by the minute or 5 minute intervals), but public analytics don't need to be more precise than by the hour IMO.

@niloc132
Copy link
Member Author

niloc132 commented Mar 5, 2023

Thanks, I had momentarily forgotten about the user agent change. Using the new user agent headers seems like the kind of thing nginx itself will support Real Soon Now - and in the meantime, if I'm not mistaken, we merely lose the ability to see which chrome newer version of chrome is running, with the frozen string?

This explains why I am seeing

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36

as the single most popular user agent string, more so than the next two combined entries combined.

Timestamps: I'm relatively unconcerned about using analytics for any real-time purpose, we already have other uptime monitoring, and will soon have backup options for hosting. Worst case, this repo makes it very easy to change to another deployment option, at least once DNS is moved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants