Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sharedcache: introduce metrics #2760

Merged
merged 1 commit into from
Aug 23, 2023
Merged

Conversation

joshimhoff
Copy link
Contributor

Fixes #2687.

sharedcache: introduce metrics

This commit adds various metrics to the shared cache. In a follow up PR, I will export these metrics as time-series metrics from CRDB.

@joshimhoff joshimhoff requested review from a team and RaduBerinde July 20, 2023 13:25
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@joshimhoff joshimhoff requested a review from itsbilal July 20, 2023 13:25
Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this!!

Reviewable status: 0 of 7 files reviewed, 6 unresolved discussions (waiting on @itsbilal and @joshimhoff)


metrics.go line 32 at r1 (raw file):

type ThroughputMetric = base.ThroughputMetric

// SharedCacheMetrics holds metrics for the persistent secondary cache

This code has been updated, we now use humanize.Count and humanize.Bytes instead of SI/IEC.


objstorage/objstorageprovider/sharedcache/shared_cache.go line 44 at r1 (raw file):

	Count int64

	// The number of calls to ReadAt.

Since we only have the At version, we can just call them "reads" in the metrics. Ats made me think of something else (atomic test-set).


objstorage/objstorageprovider/sharedcache/shared_cache.go line 55 at r1 (raw file):

	ReadAtsWithPartialHit int64
	// The number of calls to ReadAt where no data returned was read from the cache.
	ReadAtsWithFullMisses int64

[nit] maybe WithNoHit


objstorage/objstorageprovider/sharedcache/shared_cache.go line 64 at r1 (raw file):

	// The latency of calls to get some data from the cache.
	GetLatency       prometheus.Histogram
	// The latency of reads of a single cache block from disk.

I think we should keep it simple and just have DiskReadLatency and DiskWriteLatency. Not sure how much more useful it is to have Get and Put separately.


objstorage/objstorageprovider/sharedcache/shared_cache.go line 68 at r1 (raw file):

	// The latency of writing data to write back to the cache to a channel.
	// Generally should be low, but if the channel is full, could be high.
	QueuePutLatency  prometheus.Histogram

I think this is hard to explain to a higher level user, and it's really an implementation artifact. I'd be in favor of removing it.


objstorage/objstorageprovider/sharedcache/shared_cache.go line 82 at r1 (raw file):

	totalReadAts      atomic.Int64
	multiShardReadAts atomic.Int64
  multiBlockReadAts atomic.Int64

[nit] alignment

@RaduBerinde
Copy link
Member

@joshimhoff do we need to merge this before we start running experiments? Let me know if you don't have time and you want me to take it over.

@joshimhoff
Copy link
Contributor Author

Without the metrics, we will learn something, esp. since I only enabled the secondary cache on 1/3 nodes. And we will run this testing continuously anyway.

I will try to get this merged this week, tho I am also secondary oncall. If you want, def feel free to take it over, but only if you actively want to -- I think it's okay if it takes a bit to get running on the demo cluster.

Copy link
Contributor Author

@joshimhoff joshimhoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL, @RaduBerinde!

Reviewable status: 0 of 7 files reviewed, 4 unresolved discussions (waiting on @itsbilal and @RaduBerinde)


metrics.go line 32 at r1 (raw file):

Previously, RaduBerinde wrote…

This code has been updated, we now use humanize.Count and humanize.Bytes instead of SI/IEC.

Updated.


objstorage/objstorageprovider/sharedcache/shared_cache.go line 44 at r1 (raw file):

Previously, RaduBerinde wrote…

Since we only have the At version, we can just call them "reads" in the metrics. Ats made me think of something else (atomic test-set).

Done.


objstorage/objstorageprovider/sharedcache/shared_cache.go line 55 at r1 (raw file):

Previously, RaduBerinde wrote…

[nit] maybe WithNoHit

Done.


objstorage/objstorageprovider/sharedcache/shared_cache.go line 64 at r1 (raw file):

Previously, RaduBerinde wrote…

I think we should keep it simple and just have DiskReadLatency and DiskWriteLatency. Not sure how much more useful it is to have Get and Put separately.

I vote we keep all this stuff for now, as I think comparing overall latency to the latency of a single disk read / write will provide a high level of confidence in the quality of the implementation. Eventually, I agree that we don't need tho. I'll add a TODO to reconsider the set of metrics before we release the secondary cache to users.


objstorage/objstorageprovider/sharedcache/shared_cache.go line 68 at r1 (raw file):

Previously, RaduBerinde wrote…

I think this is hard to explain to a higher level user, and it's really an implementation artifact. I'd be in favor of removing it.

Same here. I vote we keep it for now, as if queueing to the channel takes time due to a full channel, we really want to know that. Again, I've added a TODO to reconsider the set of metrics before we release the secondary cache to users.


objstorage/objstorageprovider/sharedcache/shared_cache.go line 82 at r1 (raw file):

Previously, RaduBerinde wrote…

[nit] alignment

Fixed.

@joshimhoff joshimhoff force-pushed the metrics branch 2 times, most recently from 5291490 to 675a691 Compare August 23, 2023 13:53
@joshimhoff
Copy link
Contributor Author

@RaduBerinde, note for you as a reviewer: Few test failures to work thru still. I'll return to this in the afternoon.

@joshimhoff joshimhoff force-pushed the metrics branch 3 times, most recently from a15b29f to 12f3d47 Compare August 23, 2023 17:50
@joshimhoff
Copy link
Contributor Author

Tests are passing locally now.

Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: 0 of 11 files reviewed, 2 unresolved discussions (waiting on @itsbilal and @joshimhoff)


metrics.go line 35 at r3 (raw file):

// that caches commonly accessed blocks from blob storage on a local
// file system.
type SharedCacheMetrics = sharedcache.Metrics

Mind renaming to SecondaryCacheMetrics? I have it on my TODO list to rename the sharedcache package to secondarycache (as part of the general renaming where "external" storage does not necessarily mean sharing).

Copy link
Contributor Author

@joshimhoff joshimhoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

Reviewable status: 0 of 11 files reviewed, 2 unresolved discussions (waiting on @itsbilal and @RaduBerinde)


metrics.go line 35 at r3 (raw file):

Previously, RaduBerinde wrote…

Mind renaming to SecondaryCacheMetrics? I have it on my TODO list to rename the sharedcache package to secondarycache (as part of the general renaming where "external" storage does not necessarily mean sharing).

Sure! Done.

This commit adds various metrics to the shared cache. In a follow up PR, I will
export these metrics as time-series metrics from CRDB.
@joshimhoff joshimhoff merged commit 8638640 into cockroachdb:master Aug 23, 2023
10 checks passed
joshimhoff added a commit to joshimhoff/cockroach that referenced this pull request Sep 18, 2023
This commits exports counter metrics regarding the secondary cache added at
cockroachdb/pebble#2760. This commit doesn't
export the histogram metrics added in that PR. While working on this one, I
have realized a follow up PR is needed to export the bucketing scheme to CRDB.

Release note: None.
joshimhoff added a commit to joshimhoff/cockroach that referenced this pull request Sep 19, 2023
This commits exports counter metrics regarding the secondary cache added at
cockroachdb/pebble#2760. This commit doesn't
export the histogram metrics added in that PR. While working on this one, I
have realized a follow up PR is needed to export the bucketing scheme to CRDB.

Release note: None.
joshimhoff added a commit to joshimhoff/cockroach that referenced this pull request Sep 21, 2023
This commits exports counter metrics regarding the secondary cache added at
cockroachdb/pebble#2760. This commit doesn't
export the histogram metrics added in that PR. While working on this one, I
have realized a follow up PR is needed to export the bucketing scheme to CRDB.

Release note: None.
craig bot pushed a commit to cockroachdb/cockroach that referenced this pull request Sep 21, 2023
110057: kvserver: unskip and deflake replicate queue dead non voters r=andrewbaptist a=kvoli

TestReplicateQueueDeadNonVoters would flake periodically due to racing
between the snapshot and replicate queue.

Add a filter to queue processing in order to only process the scratch
range. This speeds up the test and prevents timing conditions between
the two queues under stress. The test remains skipped under stress-race
due to its size (5 nodes) causing timeouts.

Resolves: #76996
Epic: none
Release note: None

110818: kvserver: export secondary cache counter metrics r=joshimhoff a=joshimhoff

Fixes #110899.

**kvserver: export secondary cache counter metrics**

This commits exports counter metrics regarding the secondary cache added at cockroachdb/pebble#2760. This commit doesn't export the histogram metrics added in that PR. While working on this one, I have realized a follow up PR is needed to export the bucketing scheme to CRDB.

Release note: None.

111039: sql: fix flaky TestHotRangesStats r=j82w a=j82w

The test failed with timeouts under stress. The test is switched to use the SucceedsSoon which allows for more time when under stress.

Fixes: #111036

Release note: None

Co-authored-by: Austen McClernon <[email protected]>
Co-authored-by: Josh Imhoff <[email protected]>
Co-authored-by: j82w <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

objstorage: shared cache metrics
3 participants