-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ds.summary simplification #257
Comments
What you describe is a common pattern when building extensible frameworks. Would it be acceptable to scheduled this change to v6.2? |
Certainly v6.2 would be fine - I don't think it is urgent but I feel it becomes more relevant the more that we simply 'wrap' existing package functionality with privacy protection. It feels like there could be wider applications beyond just the summary function - perhaps with plotting, too? |
I was wondering if complicated function, for example 'predict', could also benefit from a similar approach. |
yes - I see what you mean, as it seems to be the same situation there |
I don't know the status of this in v6.2, but the current way in which ds.summary works is also causing some challenges in the ds-helper package. At the moment it seems to lack any parallelisation, in that a separate call is made to each server. So when you have a lot of studies, distributed globally, it takes a long time to run. A side point is when I have understood correctly that DataSHIELD calls to the servers are indeed made asynchronously (ie all at the same time) not synchronously (wait for each server to return before going to the next). Anyway, this effect is amplified in the ds-helper getStats function which produces a tabular set of stats useful for papers. So we would like to address if possible |
ds.summary was originally written to deal with simple R object types like dataframes and vectors. It tests the object type on the server side, and depending on the result, calls a serverside method that generates a privacy protected summary and returns it to the client.
There are 2 things I would like to raise about this approach:
I have been looking how the native summary() function works. It is written as a generic function, which just provides a placeholder. When new objects are defined, they can provide their own method for summarising the data. For example, an object of class blob could have a method called summary.blob(). When summary() is called with a blob object, the code in summary.blob() is executed. There is also a method summary.default() which looks at the underlying structure of an object - most objects are matrices, lists etc. with some extra attributes - and then summarises it. For example, if the blob object was just a matrix with some extra attributes, and we did not define a method for summarising it, when summary() is called, it falls back to summary.default(). This sees that the blob object is really just a matrix and summarises each column.
I wonder if a similar approach might be useful in DataSHIELD, to save having to explicitly define summary functions for every object type.
This would mean having a generic summaryDS() function which is analogous to the summary() function, but with privacy protection. Objects created with DataSHIELD on the serverside could also define their own version of summaryDS(), for example summaryDS.glm() for GLMs.
The text was updated successfully, but these errors were encountered: