IdleCompute Data Management Architecture

Getting Started

This project has a single branch: main

The project has the structure below:

IdleCompute-Data-Management-Architecture
├── LICENSE.md
├── Pipfile
├── Pipfile.lock
├── README.md
├── data
│   ├── admin
│   └── raw
├── docs
│   ├── BDM-Project1-Report.pdf
│   ├── BDM-Project2-Report.pdf
│   ├── DataPipeline.png
│   └── IdleComputeSchedule.png
└── src
    ├── dataset_analytics.py
    ├── idle_host_list.py
    ├── knowledge_graph_ABOX.py
    ├── knowledge_graph_TBOX.py
    ├── knowledge_graph_queries.txt
    ├── knowledge_graph_run.py
    ├── landing_zone.py
    ├── main.py
    ├── partition_dataset.py
    ├── read_partitioned_dataset.py
    └── schedule_generation.py

Running the App

clone the project
install dependencies using pipfile
run main.py.

This will use the sample data provided in data/raw and run through the full data pipeline.

The Use Case

This project is a proof of concept for a startup business called "IdleCompute", which aims to leverage idle computer time of individual computers to complete distributed processing. The idea is similar to seti@home or Folding@Home, but is agnostic to industry or academic setting.

The proof of concept is a data pipeline that stores a user uploaded dataset from csv or json format, lands it into a persistent storage in Parquet format, completes a parallelized exploratory data analysis, builds a linear regression model, and tests the model's accuracy.

Pipeline Highlights

The pipeline is dataset agnostic. Model building and exploratory analysis is completed "blind" to the number of attributes in the data set.
Pipeline takes advantage of HDFS file system and Parquet hybrid file format for storage and distribution efficiency
Knowledge graphs are implemented to take advantage of relationship level queries for quality assurance measures

How the Data Pipeline Works

A file structure is generated from the dataset filename to maintain metadata about each dataset and when it was uploaded. The datasets are landed in data/processed
The file is stored in HDFS files structure in Parquet format. This allows for flexibility in reading the data horizontally (rows) or vertically (columns).
A dummy schedule is read to determine when a particular dataset should be processed and which users on the distributed system should be used to complete the processing. Admin files like the schedule are stored in data/admin
The dataset is partitioned depending on the number of nodes that exist in the schedule using RDDs and Map/Reduce style operations
Partitioned dataset are stores in data/partitioned
The partitioned dataset has descriptive analytics completed. Dataset description is stored in data/analyzed, PNG plots are stored in data/analyzed/plots and serializable pickle format plots are in data/analyzed/plots_serialized
A predictive model is created and stored for future use in data/analyzed/model. The model accuracy is stored in data/analyzed

Knowledge Graph Queries

Graph analytics were used to:

Identify the dataset contributors that contribute to Tasks that most (or fewest) users are interested in analyzing
Identifying related Resources (nodes) that can have analytical work units assigned for quality assurance

A dummy data-set was created to represent the relationship between data being analyzed and the nodes completing the analysis. The data generator is located here: data/knowledge-graph/SDM_point.ipynb. This data is used to generate an ABOX (schema) shown below and TBOX (instances/data) that are then queried in GraphDB after uploading the data.

The queries used in GraphDB can be found in the knowledge_graph_queries.txt file.

To test the queries out:

Run knowledge_graph_run.py. This will generate the two required turtle files that need to be uploaded to GraphDB to run queries required for analysis.
Create a new GraphDB database. For the configuration use the following settings:
1. Ruleset: “RDFS (Optimized)”, Disable owl:sameAs
2. Indexing ID size: 32-bit, enable predicate list index
SPARQL Query and Update. Copy the text from knowledge_graph_queries.txt into the editor to run them
Download a free version of GraphDB and create a new GraphDB database. For the configuration use the following settings:
1. Ruleset: “RDFS (Optimized)”, Disable owl:sameAs
2. Indexing ID size: 32-bit, enable predicate list index

Outstanding Major Issues

Distributing Work

As a whole, the ability to land data, and complete analytics in a distributed manner while remaining "dataset agnostic" was accomplished. The most pressing difficulty is an inability to dictate which node in a cluster will handle which partition of a dataset. It is the view of the authors that using Hadoop "out of the box" in this manner will not be possible. Some manipulation of hadoop source code would be required to assign partitions to specific nodes.

As an example, Folding@Home makes use of Cosm software to achieve its distribution. There is significant further work required far beyond this project to bring this to reality.

Security

To ensure security and keep each data partition tamper-proof, the system would need to be virtualized, meaning a virtual partition would need to be created on each node's machine. To ensure the accuracy of each computation, duplicates would need to be completed by different machines and compared against each other.

Authors

Adam Broniewski

Vlada Kylynnyk

GitHub

License

IdleCompute Data Management Architecture is open source software licensed as MIT.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

IdleCompute Data Management Architecture

Table of contents

Getting Started

Running the App

The Use Case

Pipeline Highlights

How the Data Pipeline Works

Knowledge Graph Queries

Outstanding Major Issues

Distributing Work

Security

Authors

Adam Broniewski

Vlada Kylynnyk

License

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 89 Commits
.idea		.idea
data		data
docs		docs
src		src
.gitignore		.gitignore
LICENSE.md		LICENSE.md
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md

License

abroniewski/IdleCompute-Data-Management-Architecture

Folders and files

Latest commit

History

Repository files navigation

IdleCompute Data Management Architecture

Table of contents

Getting Started

Running the App

The Use Case

Pipeline Highlights

How the Data Pipeline Works

Knowledge Graph Queries

Outstanding Major Issues

Distributing Work

Security

Authors

Adam Broniewski

Vlada Kylynnyk

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages