-
Notifications
You must be signed in to change notification settings - Fork 0
Cleansing Tool
To set up the cleansing tool, docker and docker-compose should be installed. Then, clone the corresponding repository, cd into it and run docker-compose up
. After set up is complete, the application will be accessible in http://IP:5000/cleaner/web
In order to use the features of the cleansing tool, one must login first.
The default username/password is aegis/aegis
Before starting setting up cleaning rules, the datasets of interest and their variables should be registered first. The tool assumes the following hierarchy. First, we define the providers. A provider or data owner is the name of the company/organization/individual who posses the data. Each provider has a set of datasets and each dataset a set of variables. Note that it is not necessary to register all of the datasets/variables, but only the ones that you are interested to clean.
The tool offers an easy to use UI for creating the aforementioned structure.
Next, we create the cleaning rules of interest.
Rules fall under 3 different categories:
- Validation Rules: They define constraints that should be checked (e.g. if values of a column are in a desired range)
- Cleaning Rules: They define actions that should be taken, in case of a specific validation rules is being violated (e.g. if values of a column are outside a desired range, replace those values with a predefined value).
- Missing Values Rules: Define actions that should be taken, in case where there are empty values in a column.
Having registered the necessary rules, you can proceed to the cleaning process. The tool offers a simple UI for choosing the necessary provider and dataset and then upload a file to clean. Currently, the only supported type of file is CSV up to 50MB in size. For larger datasets, it is advised to use the tool's API instead.
Once the process completes, a new cleaned CSV file will be returned.
The cleansing tool offers a future where you can check all the actions in detail that took place during the cleaning process. Actions are stored as log files.
By opening a specific log file, you can get a detailed explanation of the actions that took place, as well as a nice dashboard which displays some interesting statistics.
Several API endpoints are exposed from the tool. Those endpoints are documented using Swagger and can be found here.