MAESTRO: An Approach for Composing and Analyzing Script-Based Workflows through Ontologies.
- Python3
- Conda
- Java
- Monetdb
- DfAnalyzer
- Dfa-lib-python
The application was developed in Ubuntu:
Distributor ID: Ubuntu
Description: Ubuntu 22.04.3 LTS
Release: 22.04
Codename: jammy
To list ubuntu release:
lsb_release -a
Before running the project, you must install the Python dependencies specified in the "requirements.txt" file in a virtual env. To do this, run the following command in the terminal:
conda create --name maestro
To list all env:
conda env list
To active de env:
conda activate maestro
Now, to install the requirements:
pip3 install -r requirements.txt
Dfa-lib-python is a Python package used to send provenance data to DfAnalyzer tool. To install Dfa-lib-python:
cd library
cd dfa-lib-python
python3 setup.py install
The first step of the approach is the specification of the experiment according to the concepts of experiment lines. The specification is done in the main.py file. The example present in the repository is about the application - NMFSt.P: A Notebook for Parallel Identification of Frequent Subtrees in Phylogenetic Tree Ensembles.
After specifying the experiment and running the main.py file, a meta script will be generated. The purpose of the meta script is to invoke activity scripts adapted for collecting provenance data using DfAnalyzer. In this example, three scripts are created: prospectiveProvence.py act_tree_gen.py and act_sub_find.py.
prospectiveProvenance.py: used to create data model in provenance capture tool; act_tree_gen.py: used to invoke construct.py (script wich construct phylogenomic trees using sequences in fasta files) act_sub_find.py: used to invoke sub_find.py (script wich creates subtrees from the trees generated by construct.py)
The workflow instantiated is run using:
python3 wf.py
Provenance calls are automatically inserted into the activity files. In the context of the NMFSt application, the calls were transferred to the invoked files constructor.py and sub_find.py to enable the collection of provenance for each fasta file processed by the application, as originally it only receives as a parameter the directory containing the input data to be processed.
MAESTRO
API comprises two types of functions defined as specification and manipulation functions that aim to specify the experiment on the OntoExpLine structure. The first type (specification) seeks to define the experiment according to the Experiment Line approach (conceptual level), ProvONE (concrete level), Domain Operations executed in experience (EDAM), and add Metadata (DC). The second type (manipulation) enable run queries that make possible:
- Get experiment line, EDAM, ProvONE, and metadata specification elements.
- Verify instances types and compatibilities according to OntoExpLine classes and properties (e. g.: What program implements an experiment line activity? What data is generated by a program or abstract activity? What domain operations can a program run? etc.).
- Convert an abstract specification to concrete workflow.
- Verify variation points on workflow structure.
- Add or remove optional points on the flow.
The functions are described below.
domainOperation(onto, "Operation Name")
In MAESTRO
domain operations are concepts integrated from EDAM ontology that represent a Bioinformatic operation. domainOperation() is a specification function used to specify a domain operation defined on EDAM that will be attributed to associate abstract and concrete components defined on experiment line and ProveONE branches. This function use two parameters: the ontology load previously and the domain operation label represented on EDAM, and return a ontology object.
An usage example of domainOperation() function:
#to load the ontology
ontoexpline = get_ontology("ontologies/ontoexpline.owl").load()
#to define a domain operation present on EDAM ontology
op_validation = domainOperation(ontoexpline, "Sequencing_quality_control")
createAttribute(ontoexpline, "Attribute name")
An attribute is a Line of Experience concept representing an abstraction of concrete values that may (or may not) have the same structure. To explain the concept, let's consider two programs that aim to generate phylogenomic trees, program A and program B. Although the two programs generate the same result (phylogenomic tree), the output dataset generated by each one can be different. It means that given the same input performed by both programs, both outputs can be composed of distinct types of data, more or less information, or even different formats.
This way, an attribute represents an abstraction of values that can be structurally different (e. g., by information or data types) but represent equivalent results. createrAttribute() is a specification function and use two parameters: the ontology load previously and the attribute label, and return a ontology object.
An usage example of createAttribute() function:
#Creating an attribute on Experiment Line branch
sequence_input_att = createAttribute(ontoexpline, "Input_Validation")
createRelation(ontoexpline, "Relation name")
As the attribute, a relation is an experiment line concept representing a collection of attributes. The relation concept is a dataset abstract version. createRelation() is a specification function and use two parameters: the ontology load previously and the relation label, and return a ontology object.
# Creating an input relation
input_validation = createRelation(ontoexpline, "Rel_Validation_In")
associateRelationAtt(relationObject, [attributes list])
After creating attributes and relation instances is needed to associate each other aiming with defining an abstract dataset. The function associateRelationAtt() is a specification function and uses a relation object and an attributes list as parameters.
An usage example of associateRelationAtt():
# Associating inputValidation (Relation) to a list of attributes
associateRelationAtt(inputValidation, [sequence_input_att])
createPort(ontoexpline, "Port name")
Port instances represent concrete data consumed and produced by programs and their execution parameters. createPort() is a specification function and uses the ontology and a string as parameters.
An usage example of createPort():
# Creating a port
input_sequence_port = createPort(ontoexpline, "ORTHOMCL1000")
associatePortAtt(Port, Attribute)
Although attribute and port instances represent data elements, it uses different levels. While attribute represents data at the abstract level, port represents the data object consumed/generated by programs/scripts. In Maestro's context, different programs can consume/generate the same attribute in different structures; given this, attributes need to be associated with ports explicitly to enable equivalences between attributes and ports. The function associatePortAtt() is a specification function and uses a port and attribute to create the relation.
An usage example of associatePortAtt():
associatePortAtt(sequence_input_att, input_sequence_port)
createProgram(ontoexpline, "Program name", domain_operation, "directory")
ProvONE defines a program as a computational task that consumes and produces data. In Maestro's context, the function used to instantiate a program needs as parameters: the ontology, a string representing the program's name, a domain operation, and the program's directory. createProgram() is a specification function.
An usage example of createProgram():
remove_pipe = createProgram(ontoexpline, "Remove_Pipe", op_validation, "sources/remove_pipe.py")
createActivity(ontoexpline, "Activity name", domain operation, [input_relation],
[output_relation], Boolean, [programs], Boolean)
All the instances defined so far are dependencies to create abstract activities instances. Maestro uses the Experiment Line as the primary approach to derive experiments from a specification based on abstract activities.
To create am abstract activity on Maestro, the function createActivity() uses the ontology, a string representing the name of the activity, a domain operation, a list of input relations, a list of output relations, a boolean value that represents the optionality on the flow, a list of implementers, and a boolean value to set if the activity is the first on the flow. createActivity() is a specification function.
After generating the experiment derivation, it is necessary to run the DfAnalyzer tool to capture the provenance data before run wf.py:
To run the "DfAnalyzer-1.0-50000.jar" script in the terminal:
java -jar DfAnalyzer-1.0-50000.jar
The jar file will execute the DfAnalyzer, which will be ready to receive data generated by the provenance calls inserted in the contructor.py and sub_find.py files as they are executed. The jar DfAnalyzer-1.0.jar is a DfAnalyzer version to storage strings shorter than 200 characters. DfAnalyzer-1.0-50000.jar is a version to storage strings shorter than 50000 characters.
To access the provenance base it is necessary to use the monetdb client using the password: monetdb
mclient -u monetdb -d dataflow_analyzer
To list all tables:
\d
Query example to run after wf.py run:
To list all consumed data from act_tree_gen:
select * from iact_tree_gen;
To list all generated data from act_tree_gen:
select * from oact_tree_gen;
To clean DfAnalyzer database run in terminal:
./restore-database.sh
This project is licensed under the MIT License.