- Getting Started
- The Use Case
- How the Data Pipeline Works
- Knowledge Graph Queries
- Outstanding Major Issues
- Authors
- Adam Broniewski -> GitHub | LinkedIn | Website
- Vlada Kylynnyk -> GitHub
- License
This project has a single branch: main
The project has the structure below:
IdleCompute-Data-Management-Architecture
├── LICENSE.md
├── Pipfile
├── Pipfile.lock
├── README.md
├── data
│ ├── admin
│ └── raw
├── docs
│ ├── BDM-Project1-Report.pdf
│ ├── BDM-Project2-Report.pdf
│ ├── DataPipeline.png
│ └── IdleComputeSchedule.png
└── src
├── dataset_analytics.py
├── idle_host_list.py
├── knowledge_graph_ABOX.py
├── knowledge_graph_TBOX.py
├── knowledge_graph_queries.txt
├── knowledge_graph_run.py
├── landing_zone.py
├── main.py
├── partition_dataset.py
├── read_partitioned_dataset.py
└── schedule_generation.py
- clone the project
- install dependencies using pipfile
- run main.py.
This will use the sample data provided in data/raw and run through the full data pipeline.
This project is a proof of concept for a startup business called "IdleCompute", which aims to leverage idle computer time of individual computers to complete distributed processing. The idea is similar to seti@home or Folding@Home, but is agnostic to industry or academic setting.
The proof of concept is a data pipeline that stores a user uploaded dataset from csv or json format, lands it into a persistent storage in Parquet format, completes a parallelized exploratory data analysis, builds a linear regression model, and tests the model's accuracy.
- The pipeline is dataset agnostic. Model building and exploratory analysis is completed "blind" to the number of attributes in the data set.
- Pipeline takes advantage of HDFS file system and Parquet hybrid file format for storage and distribution efficiency
- Knowledge graphs are implemented to take advantage of relationship level queries for quality assurance measures
- A file structure is generated from the dataset filename to maintain metadata about each dataset and when it was uploaded. The datasets are landed in
data/processed
- The file is stored in HDFS files structure in Parquet format. This allows for flexibility in reading the data horizontally (rows) or vertically (columns).
- A dummy schedule is read to determine when a particular dataset should be processed and which users on the distributed system should be used to complete the processing. Admin files like the schedule are stored in
data/admin
- The dataset is partitioned depending on the number of nodes that exist in the schedule using RDDs and Map/Reduce style operations
- Partitioned dataset are stores in
data/partitioned
- The partitioned dataset has descriptive analytics completed. Dataset description is stored in
data/analyzed
, PNG plots are stored indata/analyzed/plots
and serializable pickle format plots are indata/analyzed/plots_serialized
- A predictive model is created and stored for future use in
data/analyzed/model
. The model accuracy is stored indata/analyzed
Graph analytics were used to:
- Identify the dataset contributors that contribute to Tasks that most (or fewest) users are interested in analyzing
- Identifying related Resources (nodes) that can have analytical work units assigned for quality assurance
A dummy data-set was created to represent the relationship between data being analyzed and the nodes completing the analysis. The data generator is located here: data/knowledge-graph/SDM_point.ipynb
. This data is used to generate an ABOX (schema) shown below and TBOX (instances/data) that are then queried in GraphDB after uploading the data.
The queries used in GraphDB can be found in the knowledge_graph_queries.txt file.
To test the queries out:
- Run knowledge_graph_run.py. This will generate the two required turtle files that need to be uploaded to GraphDB to run queries required for analysis.
- Create a new GraphDB database. For the configuration use the following settings:
- Ruleset: “RDFS (Optimized)”, Disable owl:sameAs
- Indexing ID size: 32-bit, enable predicate list index
- SPARQL Query and Update. Copy the text from knowledge_graph_queries.txt into the editor to run them
- Download a free version of GraphDB and create a new GraphDB database. For the configuration use the following settings:
- Ruleset: “RDFS (Optimized)”, Disable owl:sameAs
- Indexing ID size: 32-bit, enable predicate list index
As a whole, the ability to land data, and complete analytics in a distributed manner while remaining "dataset agnostic" was accomplished. The most pressing difficulty is an inability to dictate which node in a cluster will handle which partition of a dataset. It is the view of the authors that using Hadoop "out of the box" in this manner will not be possible. Some manipulation of hadoop source code would be required to assign partitions to specific nodes.
As an example, Folding@Home makes use of Cosm software to achieve its distribution. There is significant further work required far beyond this project to bring this to reality.
To ensure security and keep each data partition tamper-proof, the system would need to be virtualized, meaning a virtual partition would need to be created on each node's machine. To ensure the accuracy of each computation, duplicates would need to be completed by different machines and compared against each other.
IdleCompute Data Management Architecture
is open source software licensed as MIT.