Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-47948: Preload dataset type cache for Butler server #1125

Merged
merged 10 commits into from
Dec 11, 2024
Merged

Conversation

dhirving
Copy link
Contributor

@dhirving dhirving commented Dec 5, 2024

We now preload the dataset type cache the first time a repository is accessed in Butler server. This is an optimization to avoid needing to go to the database every time we need the definition of a dataset type.

In preparation for this change, DatasetTypeCache was tweaked to make it non-generic, and to make all inner cached values immutable.

Checklist

  • ran Jenkins
  • added a release note for user-visible changes to doc/changes
  • (if changing dimensions.yaml) make a copy of dimensions.yaml in configs/old_dimensions

Copy link

codecov bot commented Dec 5, 2024

Codecov Report

Attention: Patch coverage is 96.52778% with 5 lines in your changes missing coverage. Please review.

Project coverage is 89.49%. Comparing base (f14086a) to head (8dfcee8).
Report is 11 commits behind head on main.

✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
...af/butler/registry/datasets/byDimensions/tables.py 88.23% 2 Missing and 2 partials ⚠️
.../butler/registry/datasets/byDimensions/_manager.py 98.36% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1125      +/-   ##
==========================================
+ Coverage   89.46%   89.49%   +0.02%     
==========================================
  Files         366      366              
  Lines       48781    48801      +20     
  Branches     5908     5900       -8     
==========================================
+ Hits        43644    43676      +32     
+ Misses       3721     3718       -3     
+ Partials     1416     1407       -9     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

DatasetTypeCache is only used in a single file with a specific set of types, so it no longer needs to be generic.

Upcoming changes to make it thread-safe and cloneable will be easier to reason about with concrete types.
In preparation for sharing DatasetTypeCache between threads, make its inner DynamicTables values immutable.  The mutable portion moved to a separate cache inside DatasetTypeCache.

As a side effect, this reduces the number of times we go to the DB to check for the existence of tag and calib tables.
This cache is used only by a single manager, and has never participated in the caching enable/disable logic associated with CachingContext.

Getting it out of CachingContext encapsulates its creation and use within a single manager class, simplifying upcoming changes. This also removes some unused branches from the code.
Pre-fetch dataset types the first time a repository is accessed in Butler server, to avoid the need to re-fetch them in most later operations.
Trigger dataset type preload the first time a connection is made to the Butler server in each unit test, to better match the conditions that will exist in the actual server.
Fix a bug where loading a dataset type registered externally after the Butler had loaded a "full" dataset type cache would cause an assertion failure "Dataset type cache population is incomplete" due to only filling in one of the two caches.
Copy some notes from the comments on DM-42317, and update them for the changes to dataset type caching.
@dhirving dhirving marked this pull request as ready for review December 10, 2024 21:28
Copy link
Contributor

@andy-slac andy-slac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, a couple of minor suggestions.

@@ -449,10 +451,17 @@ def makeCalibTableSpec(
return tableSpec


DynamicTablesCache: TypeAlias = ThreadSafeCache[str, sqlalchemy.Table]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, the name sounds like it is a cache for DynamicTables instances, maybe just call it TableCache or similar?

@@ -438,22 +427,15 @@ def getDatasetRef(self, id: DatasetId) -> DatasetRef | None:
run = row[self._run_key_column]
record = self._record_from_row(row)
dynamic_tables: DynamicTables | None = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This declaration is not needed any more?

@dhirving dhirving merged commit 36ed777 into main Dec 11, 2024
19 checks passed
@dhirving dhirving deleted the tickets/DM-47948 branch December 11, 2024 23:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants