From 95657d223a0ff118bba249968eddc0e70ba30f61 Mon Sep 17 00:00:00 2001 From: George Chernishev Date: Thu, 30 Nov 2023 02:03:41 +0300 Subject: [PATCH] Update README.md: more text clean-up, fixes for examples --- README.md | 23 ++++++++++++----------- 1 file changed, 12 insertions(+), 11 deletions(-) diff --git a/README.md b/README.md index fc3f98f8cc..63b753bd3f 100644 --- a/README.md +++ b/README.md @@ -13,15 +13,16 @@ Desbordante is a high-performance data profiler that is capable of discovering a * Fuzzy algebraic constraints (discovery) * Association rules (discovery) -The discovered patterns can have many uses, here are some examples: +The discovered patterns can have many uses: * For scientific data, especially those obtained experimentally, an interesting pattern allows to formulate a hypothesis that could lead to a scientific discovery. In some cases it even allows to draw conclusions immediately, if there is enough data. At the very least, the found pattern can provide a direction for further study. -* For business data it is also possible to obtain some kind of hypothesis based on found patterns. However, there are more down-to-earth and more in-demand applications in this case: clearing errors in data, finding and removing inexact duplicates, performing schema matching, and many more. +* For business data it is also possible to obtain a hypothesis based on found patterns. However, there are more down-to-earth and more in-demand applications in this case: clearing errors in data, finding and removing inexact duplicates, performing schema matching, and many more. * For training data used in machine learning applications the found patterns can help in feature engineering and in choosing the direction for the ablation study. +* For database data, found patterns can help with defining (recovering) primary and foreign keys, setting up (checking) all kinds of integrity constraints. Desbordante can be used via three interfaces: -* **Console application.** This is a classic command-line interface that aims to provide basic profiling functionality, i.e. discovery and validation of patterns. A user can specify pattern type, task type, algorithm, input file(s) and obtain results to the screen or into an output file. -* **Python bindings.** Desbordante functionality can be accessed from within Python programs by employing the Desbordante Python library. This interface offers everything that is currently provided by the console version and allows advanced use, such as building interactive applications and designing scenarios for solving some particular real-life task. Relational data processing algorithms accept Pandas dataframes as input, allowing the user to conveniently preprocess data before mining patterns. -* **Web interface.** There is a web interface that provides discovery and validation tasks with a rich interactive interface with visualization. However, currently it supports a limited number of patterns and should be considered more as an interactive demo. +* **Console application.** This is a classic command-line interface that aims to provide basic profiling functionality, i.e. discovery and validation of patterns. A user can specify pattern type, task type, algorithm, input file(s) and output results to the screen or into a file. +* **Python bindings.** Desbordante functionality can be accessed from within Python programs by employing the Desbordante Python library. This interface offers everything that is currently provided by the console version and allows advanced use, such as building interactive applications and designing scenarios for solving a particular real-life task. Relational data processing algorithms accept pandas DataFrames as input, allowing the user to conveniently preprocess the data before mining patterns. +* **Web application.** There is a web application that provides discovery and validation tasks with a rich interactive interface where results can be conveniently visualized. However, currently it supports a limited number of patterns and should be considered more as an interactive demo. A brief introduction into the tool and its use cases is presented [here](https://medium.com/@chernishev/exploratory-data-analysis-with-desbordante-4b97299cce07) (in English) and [here](https://habr.com/ru/company/unidata/blog/667636/) (in Russian). Also, a list of various articles and guides can be found [here](https://desbordante.unidata-platform.ru/papers). @@ -31,7 +32,7 @@ Usage examples: 1) Discover all exact functional dependencies in a table represented by a .csv file that uses a comma as the separator and has a header row. In this example the default FD discovery algorithm (HyFD) is used. ```sh ---task=fd --table=../examples/datasets/university_fd.csv , True +python3 cli.py --task=fd --table=../examples/datasets/university_fd.csv , True ``` ```text @@ -46,7 +47,7 @@ Usage examples: 2) Discover all approximate functional dependencies with error less than or equal to 0.1 in a table represented by a .csv file that uses a comma as the separator and has a header row. In this example the default AFD discovery algorithm (Pyro) is used. ```sh ---task=afd --table=../examples/datasets/inventory_afd.csv , True --error=0.1 +python3 cli.py --task=afd --table=../examples/datasets/inventory_afd.csv , True --error=0.1 ``` ```text @@ -58,7 +59,7 @@ Usage examples: 3) Check whether metric functional dependency “Title -> Duration” with radius 5 (using the Euclidean metric) holds in a table represented by a .csv file that uses a comma as the separator and has a header row. In this example the default MFD validation algorithm (BRUTE) is used. ```sh ---task=mfd_verification --table=../examples/datasets/theatres_mfd.csv , True --lhs_indices=0 --rhs_indices=2 --metric=euclidean --parameter=5 +python3 cli.py --task=mfd_verification --table=../examples/datasets/theatres_mfd.csv , True --lhs_indices=0 --rhs_indices=2 --metric=euclidean --parameter=5 ``` ```text @@ -157,12 +158,12 @@ algo.execute() if algo.mfd_holds(): print('MFD holds') else: - print('MFD not holds') + print('MFD does not hold') ``` ```text >>> MFD holds ``` -4) Discover approximate functional dependencies with various error thresholds. Here, we showcase a preferred approach to configuring algorithm options. Furthermore, we are using a pandas dataframe to load data from a CSV file. +4) Discover approximate functional dependencies with various error thresholds. Here, we showcase the preferred approach to configuring algorithm options. Furthermore, we are using a pandas DataFrame to load data from a CSV file. ```python-repl >>> import desbordante >>> import pandas as pd @@ -185,7 +186,7 @@ else: ## Web interface -While the Python interface makes building interactive applications possible, Desbordante also offers a web interface which is aimed specifically for interactive tasks. For pattern discovery/validation a user may specify parameters, browse results, employ advanced visualizations and filters, all in a convenient way. Furthermore, we manually select the most interesting Python scenarios and implement them inside our web version. Currently, only the typo detection scenario is implemented. +While the Python interface makes building interactive applications possible, Desbordante also offers a web interface which is aimed specifically for interactive tasks. For pattern discovery/validation a user may specify parameters, browse results, employ advanced visualizations and filters, all in a convenient way. Furthermore, we select the most interesting Python scenarios and implement them in the web version. Currently, only the typo detection scenario is implemented. You can try the deployed web version [here](https://desbordante.unidata-platform.ru/). You have to register in order to process your own datasets. Keep in mind that due to a large demand various time and memory limits are enforced (and a task is killed if it goes outside the acceptable ranges). The source code of the web interface is kept in a separate [repo](https://github.com/vs9h/Desbordante).