OOM while running LDR under heavy load with stats disabled #134445
Labels
A-disaster-recovery
branch-master
Failures and bugs on the master branch.
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
P-3
Issues/test failures with no fix SLA
T-disaster-recovery
We observed OOM events twice on SHA
v24.3.0-alpha.3-dev-b908a2f2bc3af2b1529b58e8242a37cfa6f6c1ca
. These OOMs were not observed on the 24.3 release branch, just onmaster
.cockroach-health.glenn-ldr-east-0003.ubuntu.2024-10-31T23_59_12Z.009606.log.zip
On the first event, we have a 5-second time-limited heap profile. Total heap created is small for the window, but we can see ~40GB allocated and removed in that window. That's a lot.
pprof_1730431585.out.zip
cockroach-health.glenn-ldr-east-0003.ubuntu.2024-10-30T22_05_07Z.009606.log.zip
On the second even, we have a full heap profile. 10GB are accounted for, but we can tell from runtime stats that some ~20GB are not. This suggests objects that are unreachable but have not yet been garbage collected, or that we're seeing extreme sampling effects on the heap profile.
cockroach-health.log.zip
profile-full.out.zip
The various profiles don't appear to show memory leaks, or other unusual usage. Rather, we suspect generic memory-management of multiple concurrent full-table scans experienced a regression between 24.3 branch cut and the above SHA.
Jira issue: CRDB-44081
The text was updated successfully, but these errors were encountered: