Skip to content

Cleansing Tool

Stamatis Pitsios edited this page Aug 26, 2018 · 8 revisions

Deployment

To set up the cleansing tool, docker and docker-compose should be installed. Then, clone the corresponding repository, cd into it and run docker-compose up. After set up is complete, the application will be accessible in http://IP:5000/cleaner/web

Main Page

Usage

Step 1 (Login)

In order to use the features of the cleansing tool, one must login first.

Login

The default username/password is aegis/aegis

Step 2 (Dataset Declaration)

Before starting setting up cleaning rules, the datasets of interest and their variables should be registered first. The tool assumes the following hierarchy. First, we define the providers. A provider or data owner is the name of the company/organization/individual who posses the data. Each provider has a set of datasets and each dataset a set of variables. Note that it is not necessary to register all of the datasets/variables, but only the ones that you are interested to clean.

The tool offers an easy to use UI for creating the aforementioned structure.

Dataset Registration

Step 4 (Rule Registration)

Next, we create the cleaning rules of interest.

Rules

Rules fall under 3 different categories:

  1. Validation Rules: They define constraints that should be checked (e.g. if values of a column are in a desired range)

Validation Rules

  1. Cleaning Rules: They define actions that should be taken, in case of a specific validation rules is being violated (e.g. if values of a column are outside a desired range, replace those values with a predefined value).

Cleaning Rules

  1. Missing Values Rules: Define actions that should be taken, in case where there are empty values in a column.

Missing Rules

Step 5 (Clean Data)

Having registered the necessary rules, you can proceed to the cleaning process. The tool offers a simple UI for choosing the necessary provider and dataset and then upload a file to clean. Currently, the only supported type of file is CSV up to 50MB in size. For larger datasets, it is advised to use the tool's API instead.

Clean

Once the process completes, a new cleaned CSV file will be returned.

Step 6 (Check Logs)

The cleansing tool offers a future where you can check all the actions in detail that took place during the cleaning process. Actions are stored as log files.

Logs

By opening a specific log file, you can get a detailed explanation of the actions that took place, as well as a nice dashboard which displays some interesting statistics.

Logs Details

Logs Dashboard

API

Several API endpoints are exposed from the tool. Those endpoints are documented using Swagger and can be found here.

Clone this wiki locally