diff --git a/README.md b/README.md index 568826d513..68799aefe6 100644 --- a/README.md +++ b/README.md @@ -5,49 +5,91 @@

-# General +# What is Desbordante? + +Desbordante is a high-performance data profiler that can both *discover* and *validate* many different patterns in data. The patterns we are interested in mostly belong to the family of *functional dependencies*, but we offer many other interesting features, such as inclusion dependencies, unique column combinations, and association rules. + +## What is an (exact) functional dependency? + +In theory, a functional dependency (FD) describes a relationship between two sets of attributes in a relational database. The formal definition is: A functional dependency A --> B, where A and B are attributes or sets of attributes in the table, holds if and only if for all A = X, B = Y. This means that the value of B can be unambiguously determined by the value of A. + +| A | B | C | +|---|------|---| +| 0 | *1* | 1 | +| 1 | 2 | 1 | +| 0 | *1* | 3 | + +A -> B holds --- the values for A = 0 are the same + +| A | B | C | +|---|------|---| +| 0 | *1* | 1 | +| 1 | 2 | 1 | +| 0 | *3* | 3 | + +A -> B does not hold --- the values for A = 0 are different + +This is the most basic pattern Desbordante can discover and validate, but there are many more. +## What can Desbordante do? + +Desbordante does two key tasks: Discovery and Validation. **Discovery** is designed to identify all instances of a specified pattern *type* of a given dataset. So, this means that if you would like to know **all** simple FDs that your dataset contains, you can run one of our algorithms and receive a list of them. + +In contrast, the **Validation** task is designed to check whether a specified pattern *instance* is present in a given dataset. This means that if you think that some dependency is present in your dataset, you can check whether your hypothesis is true. This task not only returns True or False, but it also explains why the instance you specified does not hold (e.g. it can list table rows with conflicting values). +### Supported patterns + +| Category | Type | Discovery | Validation | +| --------------------------------------- | ----------------------------------------------------------------------------- | ------------ | ---------- | +| **Functional Dependency Variants** | Exact functional dependencies [1] | ✅ | ✅ | +| | Approximate functional dependencies with the g1 metric | ✅ | ✅ | +| | Probabilistic functional dependencies, with `PerTuple` and `PerValue` metrics | ✅ | ❌ | +| **Graph Functional Dependencies** | | ❌ | ✅ | +| **Conditional Functional Dependencies** | | ✅ | ❌ | +| **Inclusion Dependencies** | | ✅ | ❌ | +| **Order Dependencies** | Set-based axiomatization | ✅ | ❌ | +| | List-based axiomatization | ✅ | ❌ | +| **Metric Functional Dependencies** | | ❌ | ✅ | +| **Fuzzy Algebraic Constraints** | | ✅ | ❌ | +| **Unique Column Combinations** | Exact unique column combination | ✅ | ✅ | +| | Approximate unique column combination, with g1 metric | ✅ | ✅ | +| **Association Rules** | | ✅ | ❌ | +| **Differential Dependencies** | | Coming soon! | ❌ | + + +### Potential uses +- **Scientific Data**: + - Generate hypotheses for potential discoveries. + - Draw immediate conclusions with sufficient data. + - Guide further research directions. +- **Business Data**: + - Clean data errors. + - Remove approximate duplicates. + - Match data schemas. + - Support various practical applications. +- **Machine Learning**: + - Assist in feature engineering. + - Direct ablation studies. +- **Database Management**: + - Identify primary and foreign keys. + - Establish and check integrity constraints. +### Supported interfaces +* **Console application:** Command-line access to essential profiling features with pattern specification, task and algorithm selection, and output to console or file. +* **Python bindings:** Access Desbordante within Python for an expanded feature set over the console. Create interactive programs and solve specific tasks using relational data algorithms that work with pandas DataFrames for easy data preparation. +* **Web application.** There is a web application that provides discovery and validation tasks with a rich interactive interface where results can be conveniently visualized. However, currently it supports a limited number of patterns and should be considered more as an interactive demo. -Desbordante is a high-performance data profiler that is capable of discovering and validating many different patterns in data using various algorithms. +### Documentation and guides +- [A brief introduction to the tool and its use cases (English)](https://medium.com/@chernishev/exploratory-data-analysis-with-desbordante-4b97299cce07) +- [A brief introduction to the tool and its use cases (Russian)](https://habr.com/ru/company/unidata/blog/667636/) +- [List of various articles and guides](https://desbordante.unidata-platform.ru/papers) +- [Extensive list of tutorial examples for each supported pattern](https://github.com/Desbordante/desbordante-core/tree/main/examples) -The **Discovery** task is designed to identify all instances of a specified pattern *type* of a given dataset. -The **Validation** task is different: it is designed to check whether a specified pattern *instance* is present in a given dataset. This task not only returns True or False, but it also explains why the instance does not hold (e.g. it can list table rows with conflicting values). +## Console -The currently supported data patterns are: -* Functional dependency variants: - - Exact functional dependencies (discovery and validation) - - Approximate functional dependencies, with g1 metric (discovery and validation) - - Probabilistic functional dependencies, with PerTuple and PerValue metrics (discovery) -* Graph functional dependencies (validation) -* Conditional functional dependencies (discovery) -* Inclusion dependencies (discovery) -* Order dependencies: - - set-based axiomatization (discovery) - - list-based axiomatization (discovery) -* Metric functional dependencies (validation) -* Fuzzy algebraic constraints (discovery) -* Unique column combinations: - - Exact unique column combination (discovery and validation) - - Approximate unique column combination, with g1 metric (discovery and validation) -* Association rules (discovery) - -The discovered patterns can have many uses: -* For scientific data, especially those obtained experimentally, an interesting pattern allows to formulate a hypothesis that could lead to a scientific discovery. In some cases it even allows to draw conclusions immediately, if there is enough data. At the very least, the found pattern can provide a direction for further study. -* For business data it is also possible to obtain a hypothesis based on found patterns. However, there are more down-to-earth and more in-demand applications in this case: clearing errors in data, finding and removing inexact duplicates, performing schema matching, and many more. -* For training data used in machine learning applications the found patterns can help in feature engineering and in choosing the direction for the ablation study. -* For database data, found patterns can help with defining (recovering) primary and foreign keys, setting up (checking) all kinds of integrity constraints. - -Desbordante can be used via three interfaces: -* **Console application.** This is a classic command-line interface that aims to provide basic profiling functionality, i.e. discovery and validation of patterns. A user can specify pattern type, task type, algorithm, input file(s) and output results to the screen or into a file. -* **Python bindings.** Desbordante functionality can be accessed from within Python programs by employing the Desbordante Python library. This interface offers everything that is currently provided by the console version and allows advanced use, such as building interactive applications and designing scenarios for solving a particular real-life task. Relational data processing algorithms accept pandas DataFrames as input, allowing the user to conveniently preprocess the data before mining patterns. -* **Web application.** There is a web application that provides discovery and validation tasks with a rich interactive interface where results can be conveniently visualized. However, currently it supports a limited number of patterns and should be considered more as an interactive demo. +### Examples -A brief introduction to the tool and its use cases can be found [here](https://medium.com/@chernishev/exploratory-data-analysis-with-desbordante-4b97299cce07) (in English) and [here](https://habr.com/ru/company/unidata/blog/667636/) (in Russian). Next, a list of various articles and guides can be found [here](https://desbordante.unidata-platform.ru/papers). Finally, an extensive list of tutorial examples that cover each supported pattern is available [here](https://github.com/Desbordante/desbordante-core/tree/main/examples). +All tables in the examples are stored in .csv files and have a header row. -## Console - -Usage examples: -1) Discover all exact functional dependencies in a table stored in a comma-separated file with a header row. In this example the default FD discovery algorithm (HyFD) is used. +1) Discover all exact functional dependencies in a table with the default algorithm (HyFD). ```sh python3 cli.py --task=fd --table=../examples/datasets/university_fd.csv , True @@ -63,8 +105,7 @@ python3 cli.py --task=fd --table=../examples/datasets/university_fd.csv , True [Course Semester] -> Professor ``` -2) Discover all approximate functional dependencies with error less than or equal to 0.1 in a table represented by a .csv file that uses a comma as the separator and has a header row. In this example the default AFD discovery algorithm (Pyro) is used. - +2) Discover all approximate functional dependencies with error less than or equal to 0.1 in a table with the default algorithm (Pyro). ```sh python3 cli.py --task=afd --table=../examples/datasets/inventory_afd.csv , True --error=0.1 ``` @@ -75,7 +116,7 @@ python3 cli.py --task=afd --table=../examples/datasets/inventory_afd.csv , True [ProductName] -> Price ``` -3) Check whether metric functional dependency “Title -> Duration” with radius 5 (using the Euclidean metric) holds in a table represented by a .csv file that uses a comma as the separator and has a header row. In this example the default MFD validation algorithm (BRUTE) is used. +3) Check whether a metric functional dependency “Title -> Duration” with radius 5 (using the Euclidean metric) holds with the default MFD validation algorithm (BRUTE). ```sh python3 cli.py --task=mfd_verification --table=../examples/datasets/theatres_mfd.csv , True --lhs_indices=0 --rhs_indices=2 --metric=euclidean --parameter=5 @@ -85,19 +126,24 @@ python3 cli.py --task=mfd_verification --table=../examples/datasets/theatres_mfd True ``` -For more information consult documentation and help files. - +As always, you can find more information in the documentation and the help files. We are working hard to update them, but in the meanwhile, do not hesitate to ask questions! ## Python bindings -Desbordante features can be accessed from within Python programs by employing the Desbordante Python library. The library is implemented in the form of Python bindings to the interface of the Desbordante C++ core library, using pybind11. Apart from discovery and validation of patterns, this interface is capable of providing valuable additional information which can, for example, describe why a given pattern does not hold. All this allows end users to solve various data quality problems by constructing ad-hoc Python programs. To show the power of this interface, we have implemented several demo scenarios: +Python is the go-to language of modern data science, which is why we offer Python bindings to the Desbordante C++ core library. These bindings allow you to do more than just basic tasks; they can also provide extra details, like explaining why a particular pattern does not hold. + +The library's main intended usage is to solve various data quality problems by constructing custom Python pipelines. To show the power of the library, we have implemented several demo scenarios: + 1) [Typo detection](https://colab.research.google.com/drive/1h5mQAIIxSb6Sgc_Ep8AYZlgt4BGXN6A9) 2) [Data deduplication](https://colab.research.google.com/drive/1hgF8idXi1-U4ZOR0fAmdbfbhltgEJecR?usp=sharing) 3) [Anomaly detection](https://colab.research.google.com/drive/1hgF8idXi1-U4ZOR0fAmdbfbhltgEJecR?usp=sharing) -[There is](https://desbordante.streamlit.app/) also an interactive demo for all of them, and all of these python scripts are [here](https://github.com/Desbordante/desbordante-core/tree/main/examples). The ideas behind them are briefly discussed in this [preprint](https://arxiv.org/abs/2307.14935) (Section 3). +[There is](https://desbordante.streamlit.app/) also an interactive Streamlit demo for all of them, and all of the scripts can be found [here](https://github.com/Desbordante/desbordante-core/tree/main/examples). The ideas behind them are briefly discussed in this [preprint](https://arxiv.org/abs/2307.14935) (Section 3). -Simple usage examples: -1) Discover all exact functional dependencies in a table represented by a .csv file that uses a comma as the separator and has a header row. In this example the default FD discovery algorithm (HyFD) is used. +### Examples + +All tables in the examples are stored in .csv files and have a header row. + +1) Discover all exact functional dependencies in a table with the default algorithm (HyFD). ```python import desbordante @@ -112,6 +158,7 @@ print('FDs:') for fd in result: print(fd) ``` + ```text FDs: [Course Classroom] -> Professor @@ -123,7 +170,7 @@ FDs: [Course Semester] -> Professor ``` -2) Discover all approximate functional dependencies with error less than or equal to 0.1 in a table represented by a .csv file that uses a comma as the separator and has a header row. In this example the AFD discovery algorithm Pyro is used. +2) Discover all approximate functional dependencies with error less than or equal to 0.1 in a table with the default algorithm (Pyro). ```python import desbordante @@ -139,6 +186,7 @@ print('AFDs:') for fd in result: print(fd) ``` + ```text AFDs: [Id] -> Price @@ -146,7 +194,7 @@ AFDs: [ProductName] -> Price ``` -3) Check whether metric functional dependency “Title -> Duration” with radius 5 (using the Euclidean metric) holds in a table represented by a .csv file that uses a comma as the separator and has a header row. In this example the default MFD validation algorithm (BRUTE) is used. +3) Check whether a metric functional dependency `Title -> Duration` with radius 5 (using the Euclidean metric) holds with the default MFD validation algorithm (BRUTE). ```python import desbordante @@ -169,7 +217,9 @@ else: ```text MFD holds ``` -4) Discover approximate functional dependencies with various error thresholds. Here, we are using a pandas DataFrame to load data from a CSV file. + +4) Discover approximate functional dependencies with various error thresholds. Here, we are using a pandas DataFrame to load data from a .csv file. + ```python-repl >>> import desbordante >>> import pandas as pd @@ -192,11 +242,11 @@ MFD holds ## Web interface -While the Python interface makes building interactive applications possible, Desbordante also offers a web interface which is aimed specifically for interactive tasks. Such tasks typically involve multiple steps and require substantial user input on each of them. Interactive tasks usually originate from Python scenarios, i.e. we select the most interesting ones and implement them in the web version. Currently, only the typo detection scenario is implemented. The web interface is also useful for pattern discovery and validation tasks: a user may specify parameters, browse results, employ advanced visualizations and filters, all in a convenient way. + Desbordante also offers a web interface specifically designed for interactive tasks which involve multiple steps and require substantial user input on each of them. Interactive tasks originate from full-fledged Python pipelines: we plan to identify the most interesting and useful ones and implement them in the web version. Currently, the only scenario implemented in this manner is typo detection. The web interface is also useful for pattern discovery and validation tasks: you can specify parameters, browse results, employ advanced visualizations and filters, all in a convenient way. You can try the deployed web version [here](https://desbordante.unidata-platform.ru/). You have to register in order to process your own datasets. Keep in mind that due to high demand various time and memory limits are enforced: processing is aborted if they are exceeded. The source code of the web interface is kept in a separate [repo](https://github.com/Desbordante/desbordante-server-node). -## I still don't understand how to use Desbordante and patterns :( +# I still don't understand how to use Desbordante and patterns :( No worries! Desbordante offers a novel type of data profiling, which may require that you first familiarize yourself with its concepts and usage. The most challenging part of Desbordante are the primitives: their definitions and applications in practice. To help you get started, here’s a step-by-step guide: @@ -206,53 +256,50 @@ No worries! Desbordante offers a novel type of data profiling, which may require 4) Finally, do not hesitate to ask questions in the mailing list (link below) or create an issue. ### Papers about patterns - Here is a list of papers about patterns, organized in the recommended reading order in each item: - * Functional dependency variants: - Exact functional dependencies - - [Thorsten Papenbrock et al. 2015. Functional dependency discovery: an experimental evaluation of seven algorithms. Proc. VLDB Endow. 8, 10 (June 2015), 1082–1093.](http://www.vldb.org/pvldb/vol8/p1082-papenbrock.pdf) - - [Thorsten Papenbrock and Felix Naumann. 2016. A Hybrid Approach to Functional Dependency Discovery. In Proceedings of the 2016 International Conference on Management of Data (SIGMOD '16). Association for Computing Machinery, New York, NY, USA, 821–833.](https://hpi.de/fileadmin/user_upload/fachgebiete/naumann/publications/PDFs/2016_papenbrock_a.pdf) - - Approximate functional dependencies, with g1 metric - - [Sebastian Kruse and Felix Naumann. 2018. Efficient discovery of approximate dependencies. Proc. VLDB Endow. 11, 7 (March 2018), 759–772.](https://www.vldb.org/pvldb/vol11/p759-kruse.pdf) + 1. [Thorsten Papenbrock et al. 2015. Functional dependency discovery: an experimental evaluation of seven algorithms. Proc. VLDB Endow. 8, 10 (June 2015), 1082–1093.](http://www.vldb.org/pvldb/vol8/p1082-papenbrock.pdf) + 2. [Thorsten Papenbrock and Felix Naumann. 2016. A Hybrid Approach to Functional Dependency Discovery. In Proceedings of the 2016 International Conference on Management of Data (SIGMOD '16). Association for Computing Machinery, New York, NY, USA, 821–833.](https://hpi.de/fileadmin/user_upload/fachgebiete/naumann/publications/PDFs/2016_papenbrock_a.pdf) + - Approximate functional dependencies with g1 metric + 3. [Sebastian Kruse and Felix Naumann. 2018. Efficient discovery of approximate dependencies. Proc. VLDB Endow. 11, 7 (March 2018), 759–772.](https://www.vldb.org/pvldb/vol11/p759-kruse.pdf) - Probabilistic functional dependencies, with PerTuple and PerValue metrics - - [Daisy Zhe Wang et al. Functional Dependency Generation and Applications in Pay-As-You-Go Data Integration Systems. WebDB 2009](http://webdb09.cse.buffalo.edu/papers/Paper18/webdb09.pdf) - - [Daisy Zhe Wang et al. Discovering Functional Dependencies in Pay-As-You-Go Data Integration Systems. Tech Rep. UCB/EECS-2009-119.](https://www2.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-119.pdf) + 4. [Daisy Zhe Wang et al. Functional Dependency Generation and Applications in Pay-As-You-Go Data Integration Systems. WebDB 2009](http://webdb09.cse.buffalo.edu/papers/Paper18/webdb09.pdf) + 5. [Daisy Zhe Wang et al. Discovering Functional Dependencies in Pay-As-You-Go Data Integration Systems. Tech Rep. UCB/EECS-2009-119.](https://www2.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-119.pdf) * Graph functional dependencies - - [Wenfei Fan, Yinghui Wu, and Jingbo Xu. 2016. Functional Dependencies for Graphs. In Proceedings of the 2016 International Conference on Management of Data (SIGMOD '16). Association for Computing Machinery, New York, NY, USA, 1843–1857.](https://dl.acm.org/doi/pdf/10.1145/2882903.2915232) + 6. [Wenfei Fan, Yinghui Wu, and Jingbo Xu. 2016. Functional Dependencies for Graphs. In Proceedings of the 2016 International Conference on Management of Data (SIGMOD '16). Association for Computing Machinery, New York, NY, USA, 1843–1857.](https://dl.acm.org/doi/pdf/10.1145/2882903.2915232) * Conditional functional dependencies - - [Rammelaere, J., Geerts, F. (2019). Revisiting Conditional Functional Dependency Discovery: Splitting the “C” from the “FD”. Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2018. ](https://link.springer.com/chapter/10.1007/978-3-030-10928-8_33) + 7. [Rammelaere, J., Geerts, F. (2019). Revisiting Conditional Functional Dependency Discovery: Splitting the “C” from the “FD”. Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2018. ](https://link.springer.com/chapter/10.1007/978-3-030-10928-8_33) * Inclusion dependencies (discovery) - - [Falco Dürsch et al. 2019. Inclusion Dependency Discovery: An Experimental Evaluation of Thirteen Algorithms. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management (CIKM '19). Association for Computing Machinery, New York, NY, USA, 219–228.](https://hpi.de/fileadmin/user_upload/fachgebiete/naumann/publications/PDFs/2019_duersch_inclusion.pdf) - - [Sebastian Kruse, et al: Fast Approximate Discovery of Inclusion Dependencies. BTW 2017: 207-226](http://btw2017.informatik.uni-stuttgart.de/slidesandpapers/F4-10-47/paper_web.pdf) -* Order dependencies: - - [Jaroslaw Szlichta et al. 2017. Effective and complete discovery of order dependencies via set-based axiomatization. Proc. VLDB Endow. 10, 7 (March 2017), 721–732.](http://www.vldb.org/pvldb/vol10/p721-szlichta.pdf) - - [Langer, P., Naumann, F. Efficient order dependency detection. The VLDB Journal 25, 223–241 (2016)](https://link.springer.com/article/10.1007/s00778-015-0412-3) + 8. [Falco Dürsch et al. 2019. Inclusion Dependency Discovery: An Experimental Evaluation of Thirteen Algorithms. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management (CIKM '19). Association for Computing Machinery, New York, NY, USA, 219–228.](https://hpi.de/fileadmin/user_upload/fachgebiete/naumann/publications/PDFs/2019_duersch_inclusion.pdf) + 9. [Sebastian Kruse, et al: Fast Approximate Discovery of Inclusion Dependencies. BTW 2017: 207-226](http://btw2017.informatik.uni-stuttgart.de/slidesandpapers/F4-10-47/paper_web.pdf) +* Order dependencies + 10. [Jaroslaw Szlichta et al. 2017. Effective and complete discovery of order dependencies via set-based axiomatization. Proc. VLDB Endow. 10, 7 (March 2017), 721–732.](http://www.vldb.org/pvldb/vol10/p721-szlichta.pdf) + 11. [Langer, P., Naumann, F. Efficient order dependency detection. The VLDB Journal 25, 223–241 (2016)](https://link.springer.com/article/10.1007/s00778-015-0412-3) * Metric functional dependencies - - [N. Koudas et al. "Metric Functional Dependencies," 2009 IEEE 25th International Conference on Data Engineering, Shanghai, China, 2009, pp. 1275-1278.](https://ieeexplore.ieee.org/document/4812519) + 12. [N. Koudas et al. "Metric Functional Dependencies," 2009 IEEE 25th International Conference on Data Engineering, Shanghai, China, 2009, pp. 1275-1278.](https://ieeexplore.ieee.org/document/4812519) * Fuzzy algebraic constraints - - [Paul G. Brown and Peter J. Hass. 2003. BHUNT: automatic discovery of Fuzzy algebraic constraints in relational data. In Proceedings of the 29th international conference on Very large data bases - Volume 29 (VLDB '03), Vol. 29. VLDB Endowment, 668–679.](https://www.vldb.org/conf/2003/papers/S20P03.pdf) -* Unique column combinations: - - [Sebastian Kruse and Felix Naumann. 2018. Efficient discovery of approximate dependencies. Proc. VLDB Endow. 11, 7 (March 2018), 759–772.](https://www.vldb.org/pvldb/vol11/p759-kruse.pdf) + 13. [Paul G. Brown and Peter J. Hass. 2003. BHUNT: automatic discovery of Fuzzy algebraic constraints in relational data. In Proceedings of the 29th international conference on Very large data bases - Volume 29 (VLDB '03), Vol. 29. VLDB Endowment, 668–679.](https://www.vldb.org/conf/2003/papers/S20P03.pdf) +* Unique column combinations + 14. [Sebastian Kruse and Felix Naumann. 2018. Efficient discovery of approximate dependencies. Proc. VLDB Endow. 11, 7 (March 2018), 759–772.](https://www.vldb.org/pvldb/vol11/p759-kruse.pdf) * Association rules - - [Charu C. Aggarwal, Jiawei Han. 2014. Frequent Pattern Mining. Springer Cham. pp 471.](https://link.springer.com/book/10.1007/978-3-319-07821-2) + 15. [Charu C. Aggarwal, Jiawei Han. 2014. Frequent Pattern Mining. Springer Cham. pp 471.](https://link.springer.com/book/10.1007/978-3-319-07821-2) -## Installation (this is what you probably want if you are not a project maintainer) -Desbordante is [available](https://pypi.org/project/desbordante/) at the Python Package Index (PyPI). Dependencies: +## Installation -* Python >=3.7 +Desbordante is [available](https://pypi.org/project/desbordante/) at the Python Package Index (PyPI) for Python >=3.7. -To install Desbordante type: +Install Desbordante easily: ```sh $ pip install desbordante ``` -However, as Desbordante core uses C++, additional requirements on the machine are imposed. Therefore this installation option may not work for everyone. Currently, only manylinux2014 (Ubuntu 20.04+, or any other linux distribution with gcc 10+) is supported. If the above does not work for you consider building from sources. +However, as Desbordante core is implemented in C++, there are additional requirements for your machine. Currently, we only support `manylinux2014` (Ubuntu 20.04+, or any other Linux distribution with gcc 10+). If this does not work for you, consider building from source. ## CLI installation -**NOTE**: Only Python 3.11+ is supported for CLI +**NOTE**: Only Python 3.11+ Сlone the repository, change the current directory to the project directory and run the following commands: @@ -263,17 +310,15 @@ python3 cli/cli.py --help ## Build instructions -### Ubuntu The following instructions were tested on Ubuntu 20.04+ LTS. ### Dependencies -Prior to cloning the repository and attempting to build the project, ensure that you have the following software: +Prior to cloning the repository and attempting to build the project, make sure that you have all of the dependencies installed: -- GNU g++ compiler, version 10+ -- CMake, version 3.13+ -- Boost library, version 1.74.0+ +- GNU g++ 10+ +- CMake 3.13+ +- Boost 1.74.0+ -To use test datasets you will need: -- Git Large File Storage, version 3.0.2+ +To use the test datasets, you will need Git Large File Storage 3.0.2+. ### Building the project #### Building the Python module using pip @@ -299,7 +344,7 @@ then build the tests themselves: ./build.sh -j$(nproc) ``` -The Python module can be built by providing the `--pybind` switch: +The Python module can be built with the `--pybind` switch: ```sh ./build.sh --pybind -j$(nproc) ``` @@ -332,7 +377,7 @@ We use [easyloggingpp](https://github.com/abumq/easyloggingpp) in order to log ( ## Troubleshooting ### Git LFS -If, when cloning the repo with git lfs installed, `git clone` produces the following (or similar) error: +If when cloning the repo with git lfs installed, `git clone` produces the following (or similar) error: ``` Cloning into 'Desbordante'... remote: Enumerating objects: 13440, done. @@ -351,14 +396,16 @@ GIT_LFS_SKIP_SMUDGE=1 git clone git@github.com:Mstrutov/Desbordante.git ``` ### No type hints in IDE -If type hints don't work for you in Visual Studio Code, for example, then install stubs using the command: +If type hints don't work for you in your editor of choice, then install stubs: ```sh -pip install desbordate-stubs +pip install desbordante-stubs ``` -**NOTE**: Stubs may not fully support current version of `desbordante` package, as they are updated independently. +**NOTE**: Stubs may not fully support current version of the `desbordante` package, as they are updated independently. ## Cite + If you use this software for research, please cite one of our papers: + 1) George Chernishev, et al. Solving Data Quality Problems with Desbordante: a Demo. CoRR abs/2307.14935 (2023). 2) George Chernishev, et al. "Desbordante: from benchmarking suite to high-performance science-intensive data profiler (preprint)". CoRR abs/2301.05965. (2023). 3) M. Strutovskiy, N. Bobrov, K. Smirnov and G. Chernishev, "Desbordante: a Framework for Exploring Limits of Dependency Discovery Algorithms," 2021 29th Conference of Open Innovations Association (FRUCT), 2021, pp. 344-354, doi: 10.23919/FRUCT52173.2021.9435469. @@ -366,4 +413,9 @@ If you use this software for research, please cite one of our papers: # Contacts and Q&A -If you have any questions regarding the tool usage you can ask it in our [google group](https://groups.google.com/g/desbordante). To contact dev team email George Chernishev, Maxim Strutovsky or Nikita Bobrov. +If you have any questions, don't hesitate to ask them in our [google group](https://groups.google.com/g/desbordante)! If you would like to contact the dev team, you can email: +- George Chernishev +- Maxim Strutovsky +- Nikita Bobrov + using the addresses in the papers. +