Use Cases for Search and Indexing #10

jeff-zucker · 2024-04-03T15:38:13Z

jeff-zucker
Apr 3, 2024
Maintainer

Please add below a brief description of your projects' use case(s) for search and indexing. Describe the problem, not the solution. In other words don't say "we need SPARQL or whatever", describe where the data is stored (one pod, mutiple pods), the owenership and access rights needed.

timbot1789 · 2024-04-04T21:35:05Z

timbot1789
Apr 4, 2024

My Solid Calendar app requires the ability to find all calendar events in a pod that occur over an arbitrary, end user-defined time interval. The calendar exists in one pod, and is owned by one agent, but that agent will want to share individual events with other agents.

The Solid calendar app manages a list of "events" in a user's pod that are based on the schema.org event schema. All events happen over an interval of time, described by their startDate and endDate properties (these properties can contain both date and time information, and may span any amount of time, even years). The Solid calendar app then displays all events that occur over a certain time interval, like a month, a week, or a day. The user may choose to change this time on demand, and the calendar will display a new set of events. The user may create new events on their calendar, or edit existing events. The user may also share events with other people, which grants the recipient read privileges and limited edit privileges to the event.

The set of events displayed can be represented by the following SQL query:

SELECT * from EVENT WHERE (EVENT.startDate >= @IntervalStart AND EVENT.startDate <= @IntervalEnd) OR (EVENT.endDate >= @IntervalStart AND EVENT.endDate <= @IntervalEnd) OR (EVENT.startDate <= @IntervalStart AND EVENT.endDate >= @IntervalEnd);

Solid's current system only provides limited ways to organize and retrieve such an event set. Most of the ways I know about result in either:

fetching too many events and ignoring many of them (overfetching) which can cause performance problems, or
fetching too few and not displaying all relevant events that occur in a timeframe (underfetching), which can cause behavior problems

The ways I know about are:

Store all events in one single "calendar.ttl" file. This is an efficient fetch, but it means that individual events cannot be shared. The user can only share their entire calendar. It also results in overfetching, as a person's calendar may span several years, but they're usually only interested in seeing a week or a month.
Create a "calendar" container and then create an "event-{uuid}.ttl" file for every event. This allows individual events to be shared with other users, but as there is no way to determine when events occur without fetching them, it requires the app to fetch all events in the container, which is an overfetch. It is also rather slow. This is the approach the calendar app currently implements, and there are noticeable load times for calendars with more than 100 events when the app is on the same machine as the pod server.
Create a "calendar" container, and create a series of containers underneath it for each month, which then contain the event.ttl files that occur in that month, based on their start time. Then fetch all events in the container for the month that the user is viewing. This results in an underfetch if an event spans multiple months. If an event starts in January, and ends in March, then the event should be displayed when viewing January, February, or March calendars. However, it would only be retrieved when fetching January. In order to guarantee a proper calendar display, the app must fetch all events, which then means this case devolves into 2. Copies of the event could be stored in the February and march calendars, but that makes data management more complicated. It also results in an overfetch if the user wants to view only a week or day, instead of an entire month.
Create a "calendar" container and then create an "event-{startDate}-{endDate}-{uuid}.ttl" for every event. Same as 2, but include the start and end time in the file name. This allows the app to only fetch the calendar container, and then decide based off the file names which files it needs to fetch. This avoids both underfetching and overfetching. However, it makes editing the start or end time of an event more complicated as it now functionally requires deleting an old event, replacing it with a new event, and modifying any resources that point to the old event to point to the new event. It also makes data privacy weaker, as a person with access to the calendar container, but not the individual event, can still see the event's start and end times.

4 replies

elf-pavlik Apr 4, 2024

Thank you, @timbot1789; this is great! I appreciate that you clearly separated the problem description from the solutions you have already considered. I would like to clarify one detail: besides all the calendar(s) that you own, how do you imagine working with all the events that others have shared with you? Let's say 50 individuals and 15 organizations shared specific events with you. This would mean hundreds of events across dozens of pods (s). I would also like to clarify the first problem and what I missed if I understood it correctly. Later, we could dive into various possible solutions, including replication, but it might be better to do it outside of this discussion, preferably as a Solid CG work item.

timbot1789 Apr 5, 2024

How do you imagine working with all the events that others have shared with you?

That's a good question, and one that I don't have a ready answer for. The goal is to emulate the "add guest" feature present in google calendar. With that feature, a guest is able to view an event and set whether they are attending, and the organizer can grant them authority to invite others, or modify the original event.

I haven't tried to implement this feature yet, but one (possibly naive) solution could be to store a copy of the event in the guest's pod that also links back to the source event in the organizer's pod.

The organizer could send an invite to a guest's inbox, which the guest may accept or refuse. If the guest accepts the event, the system creates a copy of the event in the guest's calendar container, with an additional field linking back to the original event on the organizer's pod. I also believe (correct me if I'm wrong) that the webhooks protocol would allow the guest to subscribe to updates from the organizer's pod. If that's not possible, the guest's calendar app could poll the original event in the background, according to some kind of "staleness" marker. This would reduce the need to hit the network for each of the hundreds of shared events (which IME have been a large source of performance issues).

timbot1789 Apr 5, 2024

I would also like to clarify the first problem and what I missed if I understood it correctly

Do you mean the "store all events in a single document"? What would you like to clarify about it?

elf-pavlik Apr 5, 2024

Thanks, @timbot1789; this use case is pretty straightforward. Yes, a webhook notification channel can be used, but something would have to act as a notification receiver, for example, some indexing bot I also suggested in solid/contacts#7 (comment). However, let's not dive too deep into various solutions in this discussion. Also, in SAI, we have a mechanism of reciprocal agent registrations that doesn't require public appendable and possibly spammable inboxes. You can see them in action in this short video

HOliver275 · 2024-04-12T20:11:11Z

HOliver275
Apr 12, 2024

This use case, like everything else about the ESPRESSO project, is a work in progress. We welcome your comments, questions, suggestions and feedback.

You should join us at the DESERE Workshop in Singapore on 13 May 2024!

More info: https://arxiv.org/abs/2403.07732

Register: https://www2024.thewebconf.org/attending/registration/

You should also come to our workshop in London on 13-14 June 2024! Watch this space for details.

ESPRESSO is a joint project between the University of Southampton and Birkbeck, University of London, funded by EPSRC. The NUS Extreme Search Centre (NExT++) and Dataswyft are project partners.

ESPRESSO is researching large-scale data search across distributed Solid servers by developing and evaluating decentralised algorithms, meta-information data structures, and indexing techniques. The aim is to support both keyword-based search and SPARQL querying. To achieve this, we are factoring information about owners’ data access and caching restrictions into our algorithm, data structure, and index design. We use generic and domain-specific scenarios to drive development and evaluation.

The generic scenarios include (i) exploratory keyword-based search (both top-k and exhaustive) across a large scale of pods to discover information or to build communities; (ii) a large number of users posing distributed SPARQL queries over a large scale of pods; and (iii) community-based keyword-based search and querying, where access to query endpoints and data is specified at the level of community membership. Domain-specific scenarios from health and wellbeing domain are also used.

The architecture that we use for experimentation includes the following components:

The ESPRESSO Indexing app, which indexes pod data, requires read access to every resource which the pod owner wishes to make available for search. The indexer requires write access to the pod, as the privacy principle requires the index to stay on the pod. Minimal, privacy-preserving information about the indexed data is stored by ESPRESSO on ESPRESSO's own pod.
The ESPRESSO Search app searches all the indexed resources to which the search party's WebID has read access. The search app needs read access to each pod's index file.
The ESPRESSO overlay network (currently based on GaianDB) routes the queries efficiently over Solid servers, and will show each search party a different view, including different ranking, for the search results depending on the access control granted to the search party's WebID.

We are experimenting with different deployment settings, where each Solid server can host numbers of pods ranging from one pod to thousands, and different distributions of pods among Solid servers will be trialled. This allows us to explore the challenges of search in settings where a large number of fine-grained access control and caching policies are managed in a single Solid server on the one hand, and where query propagation on a large scale needs to be conducted, taking access control and caching restrictions of a large number of SOLID peers into account, on the other. We also consider a range of intermediate settings in between these two extremes.

The following is an example of a health and well-being scenario that we consider:

Activity Level Study

The Task: Eve, a researcher working at the Medical Research Institute in the UK, wants to do a pre-clinical study on the relationship between activity levels and post-operative recovery time. Eve wants to select participants whose medical records show that they have undergone surgery. Eve wants to integrate this data with the step count information from an individual’s wearable fitness monitor.

The Problem: Confidential patient data is stored in silos that even authorized parties, such as Eve, often have difficulty accessing. Combining it with wellness data from commercial fitness monitors would add real-time insight about the study volunteers’ actual everyday behaviour, but it is typically stored on the device vendors’ servers and is not searchable or discoverable by classic centralized search engines – nor should it be, as special category data. Users should be able to obtain a copy of their data under GDPR, but to make the best use of it they need to be able to store it securely and share it without losing control of it.

The Solution: Each patient in this scenario has downloaded a partial copy of their patient records via an NHS app. Responding to an initiative that enables patients to make their medical records available to researchers, and be contacted for volunteering opportunities, the NHS (and/or a consortium of medical research institutions) hosts dedicated Solid servers where each patient volunteer can store the copy of their records in their own patient pod.

Some patients have also downloaded their fitness tracker data and are keeping that on a Solid pod as well. This could be the patient pod, or their personal pod hosted by the provider of their choice. Both pods would be identified by the same WebID.

Eve’s institutional MRIWebID lets her use ESPRESSO to search the patient pods for potential volunteers. These patient pods are all held on institutional servers, so Eve doesn’t have to do a global search. After she has her list WebIDs of patients who meet the study criteria, she then has a finite list of personal pods to search for fitness tracker data to which the MRIWebID has read access.

In some cases, patients will not want their medical records to be made directly searchable, but will be willing to make them discoverable in summary form. This may be all Eve needs to know at the recruitment stage.

Ideally, the fitness tracker app would have an option to share the data on Solid pods for medical research purposes, so the user does not have to acquire and store it manually. Once Solid pods have reached this level of mainstream adoption, fitness data from various vendors should eventually converge on a format that is easier to query.

Now all the data is in place, ready for Eve to do her ESPRESSO search. First, she does a keyword search over the patient pods, looking for patients who have had surgery. Once she has narrowed down a set of WebIDs of patients matching the study criteria, she first searches the patient pods for step counts (since patients can keep their fitness data in their patient pods) and then searches the personal pods for step counts (where the pod owners in Eve’s list may be keeping their fitness data, if they aren’t keeping it in their patient pods).

For more use cases, see our GitHub: https://github.com/espressogroup/ESPRESSO/blob/main/ESPRESSO%20Use%20Cases.pdf

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Cases for Search and Indexing #10

{{title}}

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Use Cases for Search and Indexing #10

jeff-zucker Apr 3, 2024 Maintainer

Replies: 2 comments · 4 replies

timbot1789 Apr 4, 2024

elf-pavlik Apr 4, 2024

timbot1789 Apr 5, 2024

timbot1789 Apr 5, 2024

elf-pavlik Apr 5, 2024

HOliver275 Apr 12, 2024

Activity Level Study

jeff-zucker
Apr 3, 2024
Maintainer

Replies: 2 comments 4 replies

timbot1789
Apr 4, 2024

HOliver275
Apr 12, 2024