-
Notifications
You must be signed in to change notification settings - Fork 15
Document Sets
Searching variables/datasets/studies/networks is the first step in the data exploration. The second step is the exploitation of the found documents: saving sets of variables, exporting data dictionaries, composing variable sets, getting some taxonomy coverage statistics, binding a variable set to a data access request, searching entities matching variables from a variable set etc.
Adding documents to a cart and saving them in sets are features available to anyone, i.e. authentication is not required.
See also GDC custom Sets User Guide.
Make search in the web data portal more useful.
Describe simply who is doing what and how to obtain the result.
# | Who | What | How | Result |
---|---|---|---|---|
1 | ||||
2 | ||||
... |
Server and (js) client.
A document (variables etc.) set is a set of documents that is:
- explicitly described by an enumerated list of variable identifiers OR described by a composition of several sets
- associated to a user
- uniquely identified
- described by a human readable name
Despite it would be very convenient to store the document set on client side, due to the limit of the browser database the documents set must always be persisted on server side (even for anonymous users) and the client will only handle the sets meta information (name, number of documents etc.). Document set operations are also performed on server side: union, intersection, complement, export etc.
Document set persistance is done in two parts:
- document set (id, name, creation date etc) are stored in MongoDB
- document set identifiers are stored in the targetted document: for instance a dedicated field of class DatasetVariable: sets that is an array of variable set identifiers to which the variable belongs.
Storing the association between a document and one or more sets in the search engine allows to apply document search criteria combined with the belonging to one or more sets, in order to:
- display documents from a set in the search page,
- count documents in the subsets when preparing variable set composition.
As the search criteria are expressed using a taxonomy (the document properties one), a vocabulary that represent the sets to which a document belongs is to be added: exact match queries will be performed on this field to extract documents belonging to one or more sets.
When a document is indexed (after a dataset has been updated for instance), the indexing process must enrich the documents with the sets they belong to. This requires for each document a query in MongoDB to find the sets that contains the document identifier; these sets identifiers are then added to the sets field for indexing. This way the document index is still usable for filtering documents by the sets they belong to, even after a re-publication.
In case some documents have been removed (after a document update), the count of documents in a set must be extracted from the document index (not from the MongoDB object).
A set operation is a list of set compositions. These compositions are expressed by a RQL query string.
A composed set is automatically created (and persisted in MongoDB within a Set Operation) when user is making operations on sets. The composed sets does not list explicitly the identifiers of the associated documents; instead of that it provides:
- the list of the set ids that are involved in the set operation,
- the query (RQL) that is to be used to develop the user query from the search page.
For instance, the user query on the composed set:
in(Mica_variable.sets,inter_s1_s2)
is developped before being submitted to the search engine as:
and(in(sets,S1),in(sets,S2))
The cart can contain sets of documents with different types.
Documents can be added to the cart when browsing the repository (network/study/dataset/variable pages) or when searching documents. From the server point of view, a document cart is a document set without a name. This document set content can be updated (addition/deletion of documents). The action of saving the document cart simply gives a name to this set (and apply the current user name if the user has logged-in in the meantime).
A document set can be created by:
- getting the list of documents from the cart,
- saving a search query results,
- composing several document sets,
- importing a list of document identifiers.
Several document sets can be composed. Result of this operation can be used to:
- create a new document set,
- download the documents.
Operations that can be performed on document sets are (see Basic operations on Sets):
- U : union
- ∩ : intersection
-
- : difference (relative complement)
The set documents statements can be described in RQL:
- union(S1,S2,S3)
- inter(S1,S2,S3)
- diff(inter(S1,S2),S3)
- diff(S1,union(S2,S3))
- etc.
The document set can be downloaded in a CSV/TSV file.
The list of the document identifiers of the set can be downloaded.
A file containing the document identifiers can be uploaded to build a new enumerated document set.
A document set can be deleted.
Document sets of a user can be listed. When several sets are selected in the list, the possible actions are: operation, download and deletion. The JS client will only the sets that are in the browser local store.
Some REST resources to manage variable sets.
REST | Description |
---|---|
GET /variables/sets | Get the variable sets associated to the current user |
GET /variables/sets?id=xxx&id=xxx | List the variable sets matching the provided identifiers |
POST /variables/sets/operations?s1=xxx&s2=xxx&s3=xxx | Create a set operation from a list of sets (maximum of three) |
GET /variables/sets/operation/xxx | Get a set operation with count of documents for each of the compositions |
DELETE /variables/sets/operation/xxx | Delete a set operation |
POST /variables/sets?name=xxx | Create a variable set from: a variable RQL query, or a set RQL query (to compose variable sets), or a cart identifier. Name can be empty (this makes a variable cart). Current user is automatically associated to the set. |
POST /variables/sets/_import?name=xxx | Create a variable set by uploading a CSV/TSV file containing variable identifiers (in the first column) |
GET /variables/set/id | Get a variable set meta-data |
GET /variables/set/id/_list?offset=0&limit=20 | Page on variables of a set |
GET /variables/set/id/_export | Download the variable identifiers list |
GET /variables/set/id/_download | Download the variable list of a set as a CSV/TSV file |
POST /variables/set/id/variables | Add variables from: a variable RQL query, or a set RQL query (to compose variable sets), or a list of variable identifiers |
DELETE /variables/set/id/variables | Delete all variables of the set. |
POST /variables/set/id/variables/_delete | Delete a specified list of variables. |
PUT /variables/set/id?name=xxx | Update/set the name of the variable set. If the variable set has no name, it is a cart and the current user name is also applied. |
PUT /variables/set/id/_delete | Mark the variable set for removal. |
DELETE /variables/set/id/_delete | Unmark the variable set for removal. |
DELETE /variables/set/id | Delete a variable set (forced). All set operations in which the set is involved will be deleted as well. |
A document set can be created from the cart:
- add documents to the cart (result of a search, single addition when navigating etc.),
- save cart content as a named document set,
- manage/compose document sets.
Or it can be directly saved from the search result:
- save results as a document set,
- manage/compose document sets.
All the functionalities (using cart, saving sets and making operations on them) are available to anyone, i.e. no need to be authenticated to access to variable sets resources of the server.
Additional portal menus are needed to give access to the cart (with dynamic item count) and to the saved variable sets.
When searching for variables using various variable / dataset / study /network criteria, the resulting list of variables can currently be downloaded. It should also be possible to add the variables to the cart.
Variables can be selected individually (across pages). If none is selected all the variables are added to the cart or to a set.
A cart is a set without a name. It should be possible for an anonymous user to create and update its content (affects server side). Then in order to prevent conflicts beteen carts from various anonymous user, the identifier of the cart is to be persisted in the browser. The local storage of the browser will be used for that purpose.
The lifecycle of the variable cart is then the following:
- when managing documents from the cart (add, list, delete),
- if the cart identifier is not known a corresponding variable set is created and identifier is stored in the local storage for further reference,
- else if a cart identifier exists but the corresponding set cannot be accessed (because it was removed from backend), a new variable set is created and cart identifier is updated accordingly.
- when user explicitly logs out, if there is any cart registered in the local storage, the corresponding set is destroyed and reference is cleared.
- a cron task on the server will clear the anonymous user related sets older than one day.
- a strategy needs to be specified when one dataset is indexed (should all the variable sets be removed or marked has being deprecated (question))
When variables are selected in the list, the actions (save, download, clear) apply to the selection, otherwise they apply to the all the variables.
Saving variables in a set offers to either create a new set or to select an existing (if any) set into which the variables will be added.
The dialog fro creating a new set requires the user to provide a unique name:
- when saving from the search page, the default name should remind the criteria of selection
- when saving from the cart or the operation page or the import dialog, the default name is "Set n" where n is the count of user sets in the brower local storage.
If there are any existing set (in the browser local storage), the set can be selected by its name and variables will be added to it (either by providing the corresponding RQL query or a list of variables identifiers).
A user can manage a list of variable sets. This page also leads to the variable set operation functionality.
Action buttons are:
- Import: opens a dialog to select a file to upload and a name,
- Export: export the list of variables identifiers that are part of the set. When several variable sets are selected, export applies to the selection. When no variable set is selected, export applies to all sets.
- Download: download a full description of the variables in the set (name, label, type, dataset, study). When several variable sets are selected, download applies to the selection. When no variable set is selected, download applies to all sets.
- Operation: variable set composition is enabled when at least two sets and no more than three sets are selected
- Delete: when several variable sets are selected, deletion applies to the selection. When no variable set is selected, delete applies to all sets. No user confirmation is required when deletion is pressed. Instead of that, the system mark the variable set(s) as being deleted (with a timestamp). Deletion will be effective after a clean-up task has identified and removed the sets that are older than 5 minutes (for instance). On deletion action, an alert message shows up allowing to undo the action.
The variable set name can be edited in-place.
When clicking on the count of variables, the search page is displayed with the list of corresponding variables: the query issued is based on the variable set field that matches exactly the variable set identifier.
The file contains a list of variable identifiers. This is the same type of file that can be obtained by exporting a variable set.
The challenge is to allow user to compose variable sets using a graphical representation.
Given two sets S1 and S2, the following expressions describe all the disjoint sub-sets (and the union of all):
Operation | RQL | Set RQL |
---|---|---|
S1 U S2 | in(sets,(S1,S2)) | union(S1,S2) |
S1 ∩ S2 | and(in(sets,S1),in(sets,S2)) | inter(S1,S2) |
S1 - S2 | and(in(sets,S1),out(sets,S2)) | diff(S1,S2) |
S2 - S1 | and(in(sets,S2),out(sets,S1)) | diff(S2,S1) |
Given three sets S1, S2 and S3 the following expressions describe all the disjoint sub-sets (and the union of all):
Operation | RQL | Set RQL |
---|---|---|
S1 U S2 U S3 | in(sets,(S1,S2,S3)) | union(S1,S2,S3) |
S1 ∩ S2 ∩ S3 | and(in(sets,S1),in(sets,S2),in(sets,S3)) | inter(S1,S2,S3) |
( S1 ∩ S2 ) - S3 | and(in(sets,S1),in(sets,S2),out(sets,S3)) | diff(inter(S1,S2),S3) |
( S2 ∩ S3 ) - S1 | and(out(sets,S1),in(sets,S2),in(sets,S3)) | diff(inter(S2,S3),S1) |
( S1 ∩ S3 ) - S2 | and(in(sets,S1),out(sets,S2),in(sets,S3)) | diff(inter(S1,S3),S2) |
S1 - ( S2 U S3 ) | and(in(sets,S1),out(sets,(S2,S3))) | diff(S1,union(S2,S3)) |
S2 - ( S1 U S3 ) | and(in(sets,S2),out(sets,(S1,S3))) | diff(S2,union(S1,S3)) |
S3 - ( S1 U S2 ) | and(in(sets,S3),out(sets,(S1,S2))) | diff(S3,union(S1,S2)) |
All these subset expressions can be transated a ES boolean queries (using must and must not statements).
These sub-sets can be represented using a Venn Diagram where each disjoint sub-set has a count and can be selected:
The result of selecting several sub-set expressions makes the union of these sub-sets.
For more than three sets, the list of all the expressions describing the sub-sets is too long and it can hardly be represented with a venn diagram. Then the graphical representation of the variable sets composition will be limited to a maximum of three sets. In order to allow more than three sets management, the result of the composition can be saved as a set (which can be composed with other sets and so on).
Saving a subset (or a union of subsets) requires a name to be provided by the user.
How can the feature be tested or demonstrated. It is important to describe this in fairly great details so anyone can perform the demo or test.
This should highlight any issues that should be addressed in further specifications, and not problems with the specification itself; since any specification with problems cannot be approved.