Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

consumption_metrics: very old cached values can be sent because of migrations #9032

Open
koivunej opened this issue Sep 18, 2024 · 6 comments
Assignees
Labels
a/consumption_metrics c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug

Comments

@koivunej
Copy link
Member

koivunej commented Sep 18, 2024

Internal slack thread: https://neondb.slack.com/archives/C061CPK7UQL/p1726556769067039?thread_ts=1726493755.104459&cid=C061CPK7UQL

Symptom:

Sent/exported synthetic sizes:

  1. synthetic_size=500GB at t=0
  2. synthetic_size=1000GB at t=10min
  3. synthetic_size=500GB at t=20min

More broadly, what happened:

  1. synthetic_size=1000GB was reported from pageserver-12.us-east-2
  2. as part of deploy, we migrated tenant shard zero to pageserver-11.us-east-2, no cached value
  3. synthetic_size=999GB from pageserver-11.us-east-2
  4. (a week passes)
  5. synthetic_size=500GB from pageserver-11.us-east-2
  6. next deploy, we migrated tenant shard zero back to pageserver-12.us-east-2, cached value was found!
  7. synthetic_size=1000GB from pageserver-12.us-east-2 (assumed cached)
  8. synthetic_size=500GB from pageserver-12.us-east-2 freshly calculated

How to fix it:

  • only use the cached value if it is more fresh than 1h?
  • remove the cached value on attachedsingle -> secondary transition?
@koivunej koivunej added a/consumption_metrics c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug labels Sep 18, 2024
@koivunej koivunej self-assigned this Sep 20, 2024
@koivunej
Copy link
Member Author

koivunej commented Sep 30, 2024

Discussed in meeting: Generation number increasing by at most 1 to reuse cached values is the best filter.

No idea right now: how to store generations in the cache? Cache might not be leaked outside, so perhaps that is easy to do.

@jcsp jcsp assigned erikgrinaker and skyzh and unassigned koivunej and erikgrinaker Oct 8, 2024
@jcsp
Copy link
Collaborator

jcsp commented Oct 8, 2024

@skyzh can you take a look at this one when you're back in the office?

@skyzh
Copy link
Member

skyzh commented Oct 18, 2024

I think storing generation would make sense, and I would like to have a new format for consumption_metrics.json (I assume nobody else except pageserver is using this json file).

  • If root["version"] = "v2", decode using the new format, otherwise, decode using v1
  • Use JSON to encode each of the metrics key, i.e.,
struct NewRawMetric {
  key: MetricsKey,
  event_type: EventType,
  generation: Generation,
  timestamp: u64,
  value: u64,
}

I'm refactoring the storing/restoring code right now...

@skyzh
Copy link
Member

skyzh commented Oct 18, 2024

... it seems that for upload_metrics_bucket and upload_metrics_http, we will need to keep the original message format

@skyzh
Copy link
Member

skyzh commented Oct 18, 2024

otherwise, I think I'll have a PR to migrate to the new format, and another patch to implement the ">= generation => do not report" logic

skyzh added a commit that referenced this issue Oct 30, 2024
… cache (#9470)

In #9032, I would like to
eventually add a `generation` field to the consumption metrics cache.
The current encoding is not backward compatible and it is hard to add
another field into the cache. Therefore, this patch refactors the format
to store "field -> value", and it's easier to maintain backward/forward
compatibility with this new format.

## Summary of changes

* Add `NewRawMetric` as the new format.
* Add upgrade path. When opening the disk cache, the codepath first
inspects the `version` field, and decide how to decode.
* Refactor metrics generation code and tests.
* Add tests on upgrade / compatibility with the old format.

---------

Signed-off-by: Alex Chi Z <[email protected]>
@skyzh
Copy link
Member

skyzh commented Nov 4, 2024

this week: add generation number to the disk cache and the logic to invalidate cache based on generation number

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a/consumption_metrics c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug
Projects
None yet
Development

No branches or pull requests

4 participants