Use Cases for Search and Indexing #10
Replies: 2 comments 4 replies
-
My Solid Calendar app requires the ability to find all calendar events in a pod that occur over an arbitrary, end user-defined time interval. The calendar exists in one pod, and is owned by one agent, but that agent will want to share individual events with other agents. The Solid calendar app manages a list of "events" in a user's pod that are based on the schema.org event schema. All events happen over an interval of time, described by their The set of events displayed can be represented by the following SQL query: SELECT * from EVENT WHERE (EVENT.startDate >= @IntervalStart AND EVENT.startDate <= @IntervalEnd) OR (EVENT.endDate >= @IntervalStart AND EVENT.endDate <= @IntervalEnd) OR (EVENT.startDate <= @IntervalStart AND EVENT.endDate >= @IntervalEnd); Solid's current system only provides limited ways to organize and retrieve such an event set. Most of the ways I know about result in either:
The ways I know about are:
|
Beta Was this translation helpful? Give feedback.
-
This use case, like everything else about the ESPRESSO project, is a work in progress. We welcome your comments, questions, suggestions and feedback. You should join us at the DESERE Workshop in Singapore on 13 May 2024! More info: https://arxiv.org/abs/2403.07732 Register: https://www2024.thewebconf.org/attending/registration/ You should also come to our workshop in London on 13-14 June 2024! Watch this space for details. ESPRESSO is a joint project between the University of Southampton and Birkbeck, University of London, funded by EPSRC. The NUS Extreme Search Centre (NExT++) and Dataswyft are project partners. ESPRESSO is researching large-scale data search across distributed Solid servers by developing and evaluating decentralised algorithms, meta-information data structures, and indexing techniques. The aim is to support both keyword-based search and SPARQL querying. To achieve this, we are factoring information about owners’ data access and caching restrictions into our algorithm, data structure, and index design. We use generic and domain-specific scenarios to drive development and evaluation. The generic scenarios include (i) exploratory keyword-based search (both top-k and exhaustive) across a large scale of pods to discover information or to build communities; (ii) a large number of users posing distributed SPARQL queries over a large scale of pods; and (iii) community-based keyword-based search and querying, where access to query endpoints and data is specified at the level of community membership. Domain-specific scenarios from health and wellbeing domain are also used. The architecture that we use for experimentation includes the following components:
We are experimenting with different deployment settings, where each Solid server can host numbers of pods ranging from one pod to thousands, and different distributions of pods among Solid servers will be trialled. This allows us to explore the challenges of search in settings where a large number of fine-grained access control and caching policies are managed in a single Solid server on the one hand, and where query propagation on a large scale needs to be conducted, taking access control and caching restrictions of a large number of SOLID peers into account, on the other. We also consider a range of intermediate settings in between these two extremes. The following is an example of a health and well-being scenario that we consider: Activity Level StudyThe Task: Eve, a researcher working at the Medical Research Institute in the UK, wants to do a pre-clinical study on the relationship between activity levels and post-operative recovery time. Eve wants to select participants whose medical records show that they have undergone surgery. Eve wants to integrate this data with the step count information from an individual’s wearable fitness monitor. The Problem: Confidential patient data is stored in silos that even authorized parties, such as Eve, often have difficulty accessing. Combining it with wellness data from commercial fitness monitors would add real-time insight about the study volunteers’ actual everyday behaviour, but it is typically stored on the device vendors’ servers and is not searchable or discoverable by classic centralized search engines – nor should it be, as special category data. Users should be able to obtain a copy of their data under GDPR, but to make the best use of it they need to be able to store it securely and share it without losing control of it. The Solution: Each patient in this scenario has downloaded a partial copy of their patient records via an NHS app. Responding to an initiative that enables patients to make their medical records available to researchers, and be contacted for volunteering opportunities, the NHS (and/or a consortium of medical research institutions) hosts dedicated Solid servers where each patient volunteer can store the copy of their records in their own patient pod. Some patients have also downloaded their fitness tracker data and are keeping that on a Solid pod as well. This could be the patient pod, or their personal pod hosted by the provider of their choice. Both pods would be identified by the same WebID. Eve’s institutional MRIWebID lets her use ESPRESSO to search the patient pods for potential volunteers. These patient pods are all held on institutional servers, so Eve doesn’t have to do a global search. After she has her list WebIDs of patients who meet the study criteria, she then has a finite list of personal pods to search for fitness tracker data to which the MRIWebID has read access. In some cases, patients will not want their medical records to be made directly searchable, but will be willing to make them discoverable in summary form. This may be all Eve needs to know at the recruitment stage. Ideally, the fitness tracker app would have an option to share the data on Solid pods for medical research purposes, so the user does not have to acquire and store it manually. Once Solid pods have reached this level of mainstream adoption, fitness data from various vendors should eventually converge on a format that is easier to query. Now all the data is in place, ready for Eve to do her ESPRESSO search. First, she does a keyword search over the patient pods, looking for patients who have had surgery. Once she has narrowed down a set of WebIDs of patients matching the study criteria, she first searches the patient pods for step counts (since patients can keep their fitness data in their patient pods) and then searches the personal pods for step counts (where the pod owners in Eve’s list may be keeping their fitness data, if they aren’t keeping it in their patient pods). For more use cases, see our GitHub: https://github.com/espressogroup/ESPRESSO/blob/main/ESPRESSO%20Use%20Cases.pdf |
Beta Was this translation helpful? Give feedback.
-
Please add below a brief description of your projects' use case(s) for search and indexing. Describe the problem, not the solution. In other words don't say "we need SPARQL or whatever", describe where the data is stored (one pod, mutiple pods), the owenership and access rights needed.
Beta Was this translation helpful? Give feedback.
All reactions