-
Notifications
You must be signed in to change notification settings - Fork 15
Document Sets
Searching variables/datasets/studies/networks is the first step in the data exploration. The second step is the exploitation of the found documents: saving sets of variables, exporting data dictionaries, composing variable sets, getting some taxonomy coverage statistics, binding a variable set to a data access request, searching entities matching variables from a variable set etc.
Adding documents to a cart and saving them in sets are features available to anyone, i.e. authentication is not required.
See also GDC Save Sets Specification As An Example.
Make search in the web data portal more useful.
Describe simply who is doing what and how to obtain the result.
# | Who | What | How | Result |
---|---|---|---|---|
1 | ||||
2 | ||||
... |
Server and (js) client.
A document (variables etc.) set is a set of documents that is:
- explicitly described by an enumerated list of variable identifiers OR described by a composition of several sets
- associated to a user
- uniquely identified
- described by a human readable name
Despite it would be very convenient to store the document set on client side, due to the limit of the browser database the documents set must always be persisted on server side (even for anonymous users) and the client will only handle the sets meta information (name, number of documents etc.). Document set operations are also performed on server side: union, intersection, complement, export etc.
Document set persistance is done in two parts:
- document set (id, name, creation date etc) are stored in MongoDB
- document set identifiers are stored in the targetted document: for instance a dedicated field of class DatasetVariable: sets that is an array of variable set identifiers to which the variable belongs.
Storing the association between a document and one or more sets in the search engine allows to apply document search criteria combined with the belonging to one or more sets, in order to:
- display documents from a set in the search page,
- count documents in the subsets when preparing variable set composition.
As the search criteria are expressed using a taxonomy (the document properties one), a vocabulary that represent the sets to which a document belongs is to be added: exact match queries will be performed on this field to extract documents belonging to one or more sets.
When a document is indexed (after a dataset has been updated for instance), the indexing process must enrich the documents with the sets they belong to. This requires for each document a query in MongoDB to find the sets that contains the document identifier; these sets identifiers are then added to the sets field for indexing. This way the document index is still usable for filtering documents by the sets they belong to, even after a re-publication.
In case some documents have been removed (after a document update), the count of documents in a set must be extracted from the document index (not from the MongoDB object).
#### Set Operation
A set operation is a list of set compositions. These compositions are expressed by a RQL query string.
#### Composed Set
A composed set is automatically created (and persisted in MongoDB within a Set Operation) when user is making operations on sets. The composed sets does not list explicitly the identifiers of the associated documents; instead of that it provides:
- the list of the set ids that are involved in the set operation,
- the query (RQL) that is to be used to develop the user query from the search page.
For instance, the user query on the composed set:
in(Mica_variable.sets,inter_s1_s2)
is developped before being submitted to the search engine as:
and(in(sets,S1),in(sets,S2))
#### Cart
The cart can contain sets of documents with different types.
Documents can be added to the cart when browsing the repository (network/study/dataset/variable pages) or when searching documents. From the server point of view, a document cart is a document set without a name. This document set content can be updated (addition/deletion of documents). The action of saving the document cart simply gives a name to this set (and apply the current user name if the user has logged-in in the meantime).
#### Creation
A document set can be created by:
- getting the list of documents from the cart,
- saving a search query results,
- composing several document sets,
- importing a list of document identifiers.
#### Operation
Several document sets can be composed. Result of this operation can be used to:
- create a new document set,
- download the documents.
Operations that can be performed on document sets are (see Basic operations on Sets):
- U : union
- ∩ : intersection
-
- : difference (relative complement)
The set documents statements can be described in RQL:
- union(S1,S2,S3)
- inter(S1,S2,S3)
- diff(inter(S1,S2),S3)
- diff(S1,union(S2,S3))
- etc.
#### Download
The document set can be downloaded in a CSV/TSV file.
#### Export
The list of the document identifiers of the set can be downloaded.
#### Import
A file containing the document identifiers can be uploaded to build a new enumerated document set.
#### Deletion
A document set can be deleted.
Document sets of a user can be listed. When several sets are selected in the list, the possible actions are: operation, download and deletion. The JS client will only the sets that are in the browser local store.
Some REST resources to manage variable sets.
REST | Description |
---|---|
GET /variables/sets | Get the variable sets associated to the current user |
GET /variables/sets?id=xxx&id=xxx | List the variable sets matching the provided identifiers |
POST /variables/sets/operations?s1=xxx&s2=xxx&s3=xxx | Create a set operation from a list of sets (maximum of three) |
GET /variables/sets/operation/xxx | Get a set operation with count of documents for each of the compositions |
DELETE /variables/sets/operation/xxx | Delete a set operation |
POST /variables/sets?name=xxx | Create a variable set from: a variable RQL query, or a set RQL query (to compose variable sets), or a cart identifier. Name can be empty (this makes a variable cart). Current user is automatically associated to the set. |
POST /variables/sets/_import?name=xxx | Create a variable set by uploading a CSV/TSV file containing variable identifiers (in the first column) |
GET /variables/set/ | Get a variable set meta-data |
GET /variables/set//_list?offset=0&limit=20 | Page on variables of a set |
GET /variables/set//_export | Download the variable identifiers list |
GET /variables/set//_download | Download the variable list of a set as a CSV/TSV file |
POST /variables/set//variables | Add variables from: a variable RQL query, or a set RQL query (to compose variable sets), or a list of variable identifiers |
DELETE /variables/set//variables | Delete all variables of the set. |
POST /variables/set//variables/_delete | Delete a specified list of variables. |
PUT /variables/set/?name=xxx | Update/set the name of the variable set. If the variable set has no name, it is a cart and the current user name is also applied. |
PUT /variables/set//_delete | Mark the variable set for removal. |
DELETE /variables/set//_delete | Unmark the variable set for removal. |
DELETE /variables/set/ | Delete a variable set (forced). All set operations in which the set is involved will be deleted as well. |
How can the feature be tested or demonstrated. It is important to describe this in fairly great details so anyone can perform the demo or test.
This should highlight any issues that should be addressed in further specifications, and not problems with the specification itself; since any specification with problems cannot be approved.