-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DM-41878: Implement RemoteButler.get() backed by a single FileDatastore #912
Conversation
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #912 +/- ##
==========================================
+ Coverage 87.50% 87.55% +0.04%
==========================================
Files 292 295 +3
Lines 38124 38241 +117
Branches 8081 8088 +7
==========================================
+ Hits 33362 33482 +120
+ Misses 3554 3553 -1
+ Partials 1208 1206 -2 ☔ View full report in Codecov by Sentry. |
9097407
to
9406699
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great.
There are some numpydoc issues and single backticks vs double backticks issues for rst fun.
@@ -808,7 +808,7 @@ def get_dataset_type(self, name: str) -> DatasetType: | |||
def get_dataset( | |||
self, | |||
id: DatasetId, | |||
storage_class: str | StorageClass | None, | |||
storage_class: str | StorageClass | None = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we mark this method as requiring keyword args after the id
?
@@ -320,3 +332,26 @@ def update(self, **kwargs: Any) -> StoredFileInfo: | |||
|
|||
def __reduce__(self) -> str | tuple[Any, ...]: | |||
return (self.from_record, (self.to_record(),)) | |||
|
|||
|
|||
class SerializedStoredFileInfo(_BaseModelCompat): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think it's a bit odd that we now have a dataclass defining the datastore record content, a pydantic model defining the datastore record content, and a method in FileDatastore for creating the database table that must match the other two but uses neither to define it. If makeTableSpec
built up its definition from StoredFileInfo
(if we are saying that it is required that StoredFileInfo
and makeTableSpec
agree but the serialized form is allowed to be different) that would at least limit the duplication.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StoredFileInfo isn't actually a dataclass, it's a subclass of the abstract base class StoredDatastoreItemInfo with a property that can't be instantiated without injecting configuration and several business logic methods. And the representation of some of the fields varies between the various cases (e.g. component is str | None
on the public API but in the database it's str | Literal['__NULL_STRING__']
.)
I had considered making StoredFileInfo have-a SerializedStoredFileInfo and forwarding the properties to it, but it didn't seem to buy much because you still have to have the duplicate properties to do the forwarding. I also think some of the representations might need to vary here (e.g. I kinda think SerializedStoredFileInfo shouldn't have a path
at all, or if it does it should always be relative. Whereas the one in the StoredFileInfo is sometimes absolute.)
I also considered making StoredFileInfo inherit-from SerializedStoreFileInfo or an additional shared pydantic model for the duplicated fields, but inheritance seems harder to understand than duplicating 5 string properties. And if we add more properties, they will likely be non-nullable on StoredFileInfo but will have to be nullable on the serialized one to handle backwards compatibility.
I do think that the methods on StoredFileInfo could be moved to FileDatastore and StoredFileInfo could become an actual dataclass. Or like if we were using the SqlAlchemy ORM layer StoredFileInfo could just be the ORM object. But I know you guys had plans with exposing these more places which is why you added the abstract base class, so to my mind that kind of change is way outside the scope of this ticket.
def prepare_get_for_external_client(self, ref: DatasetRef) -> FileDatastoreGetPayload: | ||
# Docstring inherited | ||
|
||
def to_file_info_payload(info: DatasetLocationInformation) -> FileDatastoreGetPayloadFileInfo: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the helper function needed here? Isn't it clearer to do instead:
file_records = []
for location, file_info in self._get_dataset_locations_info(ref):
file_records.append(FileDatastoreGetPayloadFileInfo(...)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤷 to me it's equally clear to have a small function explicitly declaring what its inputs and outputs are, but that might be the typescript programmer in me talking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A timing test in ipython hints that calling the helper function takes a third longer than doing the loop directly. It seems a lot of overhead for something that is simpler without it (and there are still type annotations you can add to the list above to make it clear.
dataId=ref.dataId, | ||
) | ||
return get_dataset_as_python_object_from_get_info( | ||
datastore_file_info, ref=ref, parameters=parameters, cache_manager=DatastoreDisabledCacheManager() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the cache being disabled here? Isn't the cache the thing we want for remote data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do eventually want it to be enabled, but I'll need to make a place for it to live and still need to audit it for thread-safety. I didn't see any locks in there so I assumed by default that it's probably not threadsafe.
Also it probably should be configurable on the client side, I'm assuming that the services backend use case probably does not want caching on a local disk. Probably should be on by default for the notebook use case.
I should have made a ticket for this though, I'll do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not using globals and is meant to be safe for use with multiple processes sharing a cache. I would not expect threads to be a problem -- it mostly scans the file system for its source of truth and when a file is being retrieved it puts it in a separate location to make sure another process doesn't delete the file whilst it's being read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah so it seems that having a separate instance of DatastoreCacheManager per-thread is likely threadsafe, but sharing an instance across threads is not because of self._cache_entries
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's a mutable object so that's always going to be a problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened DM-42066
cf30c39
to
9ef02ce
Compare
In preparation for extracting portions of the FileDatastore get() method to a free function to be used for client/server, remove references to 'self' and split the function into a few pieces.
Convert portions of FileDatastore.get() to free functions, to allow the get operation to be called on the client side without instantiating a FileDatastore object
Added re-usable functions for sending a POST using a Pydantic model, and for parsing a response using the model.
Fixed an issue where UnboundLocalError would be thrown when passing a dict as a DataId to a RemoteButler function
Added a partial implementation of get() for RemoteButler, which so far only handles the DatasetRef form of get(). This modifies the DataStore interface to include a function that serializes the information necessary to do the get() on the client side.
Remove a small amount of duplicated code by simplifying the function parameters
Co-authored-by: Tim Jenness <[email protected]>
9ef02ce
to
d718a09
Compare
Checklist
doc/changes