A cooperative data cleaning standalone application.
- A statement of need
- Installation instructions
- Example usage
- Community guidelines
- Tests
- Software license
DataCleaningTool is a user friendly, free and open source standalone application in order to support the task of data cleaning in a cooperative way. The tool can identify the potential data problems and report results such that the users can make informed decisions to clean data effectively.
The primary ideas behind developing DataCleaningTool are the following.
-
Time effectivity - Data cleaning is a time daunting task to go through manually large number of datasets for identifying the errors. DataCleaningTool is an application for systematically examining data for errors and automatically cleaning them using algorithms.
-
Cooperativeness - DataCleaningTool is not a black box which implies that it does not produce any result which cannot be easily understandable by the user. It motivates and illustrates its suggestions. However, the user is always in control who takes the final action at every stage of the data cleaning process.
-
Addressing reasonable number of data problems which cause erroneous conclusion and failing algorithms - DataCleaningTool aims to clean data by resolving inconsistencies, smoothing noisy data, dealing with outliers or imputing missing observations using model based imputation method.
DataCleaningTool should be of interest mainly to readers in the area of data science.
DataCleaningTool has been developed and tested in MATLAB Version: R2018b and requires the following toolboxes.
-
System Identification Toolbox;
-
Statistics and Machine Learning Toolbox;
-
Financial Toolbox;
-
MATLAB Report Generator.
DataCleaningTool Standalone Desktop App has been tested in Windows 10 and requires the Matlab Compiler Runtime (MCR) R2018b to be installed.
Standalone Desktop App.
Step 1. Download Standalone Desktop App/for_redistribution.zip and unzip it to a preferred location.
Step 2. Run the executable file "DataCleaningTool.exe" and follow instructions. If not already present, the MATLAB Compiler Runtime (mcr) R2018b will be downloaded from the web and installed automatically.
Step 3. Once installed, the app is added to the Start Menu in Windows.
Step 4. Click the app icon to run the program.
MATLAB App.
Step 1. Download MATLAB App/DataCleaningTool.mlappinstall.
Step 2. Add the app installer file "DataCleaningTool.mlappinstall" to the current working folder in MATLAB.
Step 3. Double-click "DataCleaningTool.mlappinstall".
Step 4. A dialog box is opened. Click 'Install'.
Step 5. Once installed, the app is added to the MATLAB Toolstrip. Locate the installed app and select 'Add to favorites'.
Step 6. Click the app icon to run the program.
Functions.
Step 1. Download all the *.m files from DataCleaningTool/Functions. The folder contains a main file and ten dependent files.
Step 2. Add all the *.m files to the current working folder in MATLAB.
Step 3. Run "DataCleaningTool.m".
To access the complete reference documentation with the exemplary dataset demodata.csv, refer to UserManual.pdf.
DataCleaningTool is a data cleaning application which consists of multiple widgets and buttons. Each widget illustrates its corresponding data cleaning mechanism and each button aims to deal with a specific data problem.
A simple example is demonstrated on the Imputation widget using the example dataset demodata_clean.csv. The Imputation widget displays information about missing data and the expected error of imputation for numerical and categorical features. The properties of the Imputation widget are as follows.
-
The widgets shows statistics about missing data such as percentage of missing data, expected error of imputation for numerical and categorical features. The performance analysis results of the missForest method is used to predict the expected error of imputation for numerical and categorical features for the specific ratio of data and percentage of missing data.
-
The widget also presents the missing observations percentage table and the missingness plot.
-
The
Delete Feature button
is used to delete a feature from data. This drops a feature which contains a large number of missing values. -
The
Impute button
is used to replace missing observations by estimated ones using missForest algorithm. If datetime observations are missing, a message stating that datetime imputation is not possible appears in red color in the lower side of the Imputation widget. -
The information of the missing data in the widget gets updated after each activity.
Drop feature with large number of missing observations.
In the example dataset, 'Longitude' has a large number of missing values. We use Delete Feature button
to delete 'Longitude' feature.
Step 1. Select a feature from Feature column of missing observations percentage table. Click Delete Feature button
.
Step 2. Delete Feature button
in use turns grey in color and returns back to its original color once the task is completed. Check that the selected feature is deleted.
Illustrate and impute missing observations.
We use Impute button
to impute missing values in the example data using missForest method.
Step 1. Click Impute button
.
Step 2. Impute button
in use turns grey in color and returns back to its original color once it completes its task. Check that the missing observations are imputed.
To contribute fixes, feature modifications or enhancements, a pull request can be created in the Pull requests tab of the project GitHub repository. When contributing to the software, the folowing should be included.
-
Description of the change;
-
Check that all tests pass;
-
Include new tests to report the change.
Any feature request or issue can be submitted to the the Issues tab of the project GitHub repository. When reporting issues with the software, the folowing should be included.
-
Description of the problem;
-
Error message;
-
MATLAB version and Operating System.
If any support needed, the author can be contacted by e-mail [email protected].
Step 1. Download Standalone Desktop App/for_testing.zip and unzip it to a preferred location.
Step 2. Run "DataCleaningTool.exe" for testing.
DataCleaningTool is released under the LICENSE GNU General Public License v3.0.