docs(comparison): feature description

ydataai · Sep 25, 2022 · b5f3f40 · b5f3f40
1 parent 56f6f81
commit b5f3f40
Show file tree

Hide file tree

Showing 2 changed files with 57 additions and 6 deletions.
diff --git a/docsrc/source/index.rst b/docsrc/source/index.rst
@@ -3,31 +3,32 @@
 .. toctree::
    :maxdepth: 3
    :caption: Getting started
-   :hidden: 
+   :hidden:
 
    pages/getting_started/overview
    pages/getting_started/installation
    pages/getting_started/quickstart
    pages/getting_started/concepts
    pages/getting_started/examples
-   
+
 .. toctree::
    :maxdepth: 3
    :caption: Use cases
    :hidden:
 
    pages/use_cases/big_data
    pages/use_cases/sensitive_data
+   pages/use_cases/comparing_datasets
    pages/use_cases/metadata
    pages/use_cases/custom_report_appearance
-  
+
 
 .. toctree::
    :maxdepth: 3
    :caption: Integrations
    :hidden:
 
-   pages/integrations/other_dataframe_libraries  
+   pages/integrations/other_dataframe_libraries
    pages/integrations/great_expectations
    pages/integrations/data_apps
    pages/integrations/pipelines
@@ -57,7 +58,7 @@
    pages/support_contrib/help_troubleshoot
    pages/support_contrib/common_issues
    pages/support_contrib/contribution_guidelines
-   
+
 .. toctree::
    :maxdepth: 3
    :caption: Reference
@@ -68,4 +69,4 @@
    pages/reference/history
    pages/reference/announcements
    pages/reference/resources
-   
+
diff --git a/docsrc/source/pages/comparing_datasets.rst b/docsrc/source/pages/comparing_datasets.rst
@@ -0,0 +1,50 @@
+==================
+Dataset Comparison
+==================
+
+*This feature was introduced in pandas-profiling 3.4.*
+
+``pandas-profiling`` can be used to compare multiple version of the same dataset.
+This is useful when comparing data from multiple time periods, such as two years.
+Another common scenario is to view the dataset profile for training, validation and test sets in machine learning.
+
+The following syntax can be used to compare two datasets:
+
+.. code-block:: python
+
+    from pandas_profiling import ProfileReport
+
+    train_df = pd.read_csv("train.csv")
+    train_report = ProfileReport(train_df, title="Train")
+
+    test_df = pd.read_csv("test.csv")
+    test_report = ProfileReport(test_df, title="Test")
+
+    comparison_report = train_report.compare(test_report)
+    comparison_report.to_file("comparison.html")
+
+The comparison report uses the ``title`` attribute out of ``Settings`` as a label throughout.
+The colors are configured in ``settings.html.style.primary_colors``.
+The numeric precision parameter ``settings.report.precision`` can be played with to obtain some additional space in reports.
+
+
+In order to compare more than two reports, the following syntax can be used:
+
+.. code-block:: python
+
+    from pandas_profiling import ProfileReport, compare
+
+    comparison_report = compare([train_report, validation_report, test_report])
+
+    # Obtain merged statistics
+    statistics = comparison_report.get_description()
+
+    # Save report to file
+    comparison_report.to_file("comparison.html")
+
+Note that generating reports for three or more datasets is not (yet) fully supported.
+It is possible to obtain the statistics - the report may have formatting issues.
+
+.. pull-quote::
+
+    ⌛ Interested in uncovering more temporal patterns? Check out `popmon <https://github.com/ing-bank/popmon>`_.