-
Notifications
You must be signed in to change notification settings - Fork 452
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sharedcache: introduce metrics #2760
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this!!
Reviewable status: 0 of 7 files reviewed, 6 unresolved discussions (waiting on @itsbilal and @joshimhoff)
metrics.go
line 32 at r1 (raw file):
type ThroughputMetric = base.ThroughputMetric // SharedCacheMetrics holds metrics for the persistent secondary cache
This code has been updated, we now use humanize.Count
and humanize.Bytes
instead of SI/IEC.
objstorage/objstorageprovider/sharedcache/shared_cache.go
line 44 at r1 (raw file):
Count int64 // The number of calls to ReadAt.
Since we only have the At version, we can just call them "reads" in the metrics. Ats
made me think of something else (atomic test-set).
objstorage/objstorageprovider/sharedcache/shared_cache.go
line 55 at r1 (raw file):
ReadAtsWithPartialHit int64 // The number of calls to ReadAt where no data returned was read from the cache. ReadAtsWithFullMisses int64
[nit] maybe WithNoHit
objstorage/objstorageprovider/sharedcache/shared_cache.go
line 64 at r1 (raw file):
// The latency of calls to get some data from the cache. GetLatency prometheus.Histogram // The latency of reads of a single cache block from disk.
I think we should keep it simple and just have DiskReadLatency and DiskWriteLatency. Not sure how much more useful it is to have Get and Put separately.
objstorage/objstorageprovider/sharedcache/shared_cache.go
line 68 at r1 (raw file):
// The latency of writing data to write back to the cache to a channel. // Generally should be low, but if the channel is full, could be high. QueuePutLatency prometheus.Histogram
I think this is hard to explain to a higher level user, and it's really an implementation artifact. I'd be in favor of removing it.
objstorage/objstorageprovider/sharedcache/shared_cache.go
line 82 at r1 (raw file):
totalReadAts atomic.Int64 multiShardReadAts atomic.Int64 multiBlockReadAts atomic.Int64
[nit] alignment
@joshimhoff do we need to merge this before we start running experiments? Let me know if you don't have time and you want me to take it over. |
Without the metrics, we will learn something, esp. since I only enabled the secondary cache on 1/3 nodes. And we will run this testing continuously anyway. I will try to get this merged this week, tho I am also secondary oncall. If you want, def feel free to take it over, but only if you actively want to -- I think it's okay if it takes a bit to get running on the demo cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PTAL, @RaduBerinde!
Reviewable status: 0 of 7 files reviewed, 4 unresolved discussions (waiting on @itsbilal and @RaduBerinde)
metrics.go
line 32 at r1 (raw file):
Previously, RaduBerinde wrote…
This code has been updated, we now use
humanize.Count
andhumanize.Bytes
instead of SI/IEC.
Updated.
objstorage/objstorageprovider/sharedcache/shared_cache.go
line 44 at r1 (raw file):
Previously, RaduBerinde wrote…
Since we only have the At version, we can just call them "reads" in the metrics.
Ats
made me think of something else (atomic test-set).
Done.
objstorage/objstorageprovider/sharedcache/shared_cache.go
line 55 at r1 (raw file):
Previously, RaduBerinde wrote…
[nit] maybe
WithNoHit
Done.
objstorage/objstorageprovider/sharedcache/shared_cache.go
line 64 at r1 (raw file):
Previously, RaduBerinde wrote…
I think we should keep it simple and just have DiskReadLatency and DiskWriteLatency. Not sure how much more useful it is to have Get and Put separately.
I vote we keep all this stuff for now, as I think comparing overall latency to the latency of a single disk read / write will provide a high level of confidence in the quality of the implementation. Eventually, I agree that we don't need tho. I'll add a TODO to reconsider the set of metrics before we release the secondary cache to users.
objstorage/objstorageprovider/sharedcache/shared_cache.go
line 68 at r1 (raw file):
Previously, RaduBerinde wrote…
I think this is hard to explain to a higher level user, and it's really an implementation artifact. I'd be in favor of removing it.
Same here. I vote we keep it for now, as if queueing to the channel takes time due to a full channel, we really want to know that. Again, I've added a TODO to reconsider the set of metrics before we release the secondary cache to users.
objstorage/objstorageprovider/sharedcache/shared_cache.go
line 82 at r1 (raw file):
Previously, RaduBerinde wrote…
[nit] alignment
Fixed.
5291490
to
675a691
Compare
@RaduBerinde, note for you as a reviewer: Few test failures to work thru still. I'll return to this in the afternoon. |
a15b29f
to
12f3d47
Compare
Tests are passing locally now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 11 files reviewed, 2 unresolved discussions (waiting on @itsbilal and @joshimhoff)
metrics.go
line 35 at r3 (raw file):
// that caches commonly accessed blocks from blob storage on a local // file system. type SharedCacheMetrics = sharedcache.Metrics
Mind renaming to SecondaryCacheMetrics
? I have it on my TODO list to rename the sharedcache
package to secondarycache
(as part of the general renaming where "external" storage does not necessarily mean sharing).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR!
Reviewable status: 0 of 11 files reviewed, 2 unresolved discussions (waiting on @itsbilal and @RaduBerinde)
metrics.go
line 35 at r3 (raw file):
Previously, RaduBerinde wrote…
Mind renaming to
SecondaryCacheMetrics
? I have it on my TODO list to rename thesharedcache
package tosecondarycache
(as part of the general renaming where "external" storage does not necessarily mean sharing).
Sure! Done.
This commit adds various metrics to the shared cache. In a follow up PR, I will export these metrics as time-series metrics from CRDB.
This commits exports counter metrics regarding the secondary cache added at cockroachdb/pebble#2760. This commit doesn't export the histogram metrics added in that PR. While working on this one, I have realized a follow up PR is needed to export the bucketing scheme to CRDB. Release note: None.
This commits exports counter metrics regarding the secondary cache added at cockroachdb/pebble#2760. This commit doesn't export the histogram metrics added in that PR. While working on this one, I have realized a follow up PR is needed to export the bucketing scheme to CRDB. Release note: None.
This commits exports counter metrics regarding the secondary cache added at cockroachdb/pebble#2760. This commit doesn't export the histogram metrics added in that PR. While working on this one, I have realized a follow up PR is needed to export the bucketing scheme to CRDB. Release note: None.
110057: kvserver: unskip and deflake replicate queue dead non voters r=andrewbaptist a=kvoli TestReplicateQueueDeadNonVoters would flake periodically due to racing between the snapshot and replicate queue. Add a filter to queue processing in order to only process the scratch range. This speeds up the test and prevents timing conditions between the two queues under stress. The test remains skipped under stress-race due to its size (5 nodes) causing timeouts. Resolves: #76996 Epic: none Release note: None 110818: kvserver: export secondary cache counter metrics r=joshimhoff a=joshimhoff Fixes #110899. **kvserver: export secondary cache counter metrics** This commits exports counter metrics regarding the secondary cache added at cockroachdb/pebble#2760. This commit doesn't export the histogram metrics added in that PR. While working on this one, I have realized a follow up PR is needed to export the bucketing scheme to CRDB. Release note: None. 111039: sql: fix flaky TestHotRangesStats r=j82w a=j82w The test failed with timeouts under stress. The test is switched to use the SucceedsSoon which allows for more time when under stress. Fixes: #111036 Release note: None Co-authored-by: Austen McClernon <[email protected]> Co-authored-by: Josh Imhoff <[email protected]> Co-authored-by: j82w <[email protected]>
Fixes #2687.
sharedcache: introduce metrics
This commit adds various metrics to the shared cache. In a follow up PR, I will export these metrics as time-series metrics from CRDB.