Report comparisons #1069

sbrugman · 2022-09-25T22:21:04Z

This pull request introduces the report comparison functionality into pandas-profiling. Various other fixes/changes are included.

pandas-profiling can be used to compare multiple version of the same dataset.
This is useful when comparing data from multiple time periods, such as two years.
Another common scenario is to view the dataset profile for training, validation and test sets in machine learning.

The following syntax can be used to compare two datasets:

from pandas_profiling import ProfileReport

train_df = pd.read_csv("train.csv")
train_report = ProfileReport(train_df, title="Train")

test_df = pd.read_csv("test.csv")
test_report = ProfileReport(test_df, title="Test")

comparison_report = train_report.compare(test_report)
comparison_report.to_file("comparison.html")

Settings.html.style.primary_color is replaced by a sequence of colors Settings.html.style.primary_colors. For backwards compatibility, the first element of the sequence is still accessible through the primary_color attribute, however deprecation in the future is desired. The comparison feature requires to change the configuration of colors. Where previously there was no distinction between the parts specific to the report, and parts specific to a single dataset summary, this distinction is introduced now. It could be misleading to simply use the first of the colors as a report color (The first #377eb8 now is close to the default report color #337ab7). Future work could go into configuring the report color and chosing better defaults.

Refactoring the summary datastructure is partially out of scope for this PR. (For this there are other design decisions that have been considered before), see: #1102

codecov-commenter · 2022-10-02T00:33:04Z

Codecov Report

Base: 90.90% // Head: 90.44% // Decreases project coverage by -0.45% ⚠️

Coverage data is based on head (5d37756) compared to base (7b73e8a).
Patch coverage: 87.16% of modified lines in pull request are covered.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #1069      +/-   ##
===========================================
- Coverage    90.90%   90.44%   -0.46%     
===========================================
  Files          178      181       +3     
  Lines         5090     5507     +417     
===========================================
+ Hits          4627     4981     +354     
- Misses         463      526      +63

Flag	Coverage Δ
py3.8-ubuntu-latest-pandas	`90.44% <87.16%> (-0.46%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
src/pandas_profiling/model/missing.py	`91.42% <ø> (+1.95%)`	⬆️
.../pandas_profiling/model/pandas/dataframe_pandas.py	`92.85% <ø> (-1.27%)`	⬇️
...ing/report/presentation/flavours/html/templates.py	`100.00% <ø> (ø)`
...iling/report/structure/variables/render_complex.py	`29.41% <ø> (ø)`
...ofiling/report/structure/variables/render_count.py	`33.33% <ø> (ø)`
...iling/report/structure/variables/render_generic.py	`100.00% <ø> (ø)`
...ofiling/report/structure/variables/render_image.py	`29.62% <ø> (ø)`
...rofiling/report/structure/variables/render_path.py	`31.57% <ø> (ø)`
...profiling/report/structure/variables/render_url.py	`100.00% <ø> (ø)`
tests/issues/test_issue169.py	`100.00% <ø> (ø)`
... and 36 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

akx

🙌 for small granular commits :)
There are a bunch of long if isinstance(..., tuple) / ... style branches which seem to partially duplicate code and as such will eventually cause drift between the branches. I suppose these are to adapt rendering for one report or more than one report when comparing.
- Similarly there are functions like _get_n for that, not to mention the @list_args decorator...
- This also repeats for the templates, with if alerts is mapping sort of branches that also have a slightly different formatting..?
- It would probably be nicer to not have those branches, but adapt the code everywhere to the general case of 1 or more reports.
- a7b4c11 wouldn't probably be required separately if that was the case.
Should there be a test for adfa846?
Should the (mechanical?) reindentation of the template(s) happen in a separate commit?
There are some commits that don't really look refactor: to me, such as cbfc018, 85ad2c5, d7c583c, 3c97aed, 4828c15, 6354e90, fe181ea (what does "namespace invariant type check" actually mean?)
54d24ea's commit message doesn't make much sense to me..? Should the new? behavior be documented?
To paraphrase @fabclmnt, can you follow the correct PR naming pattern?

src/pandas_profiling/compare_reports.py

src/pandas_profiling/report/structure/overview.py

src/pandas_profiling/report/structure/report.py

src/pandas_profiling/visualisation/plot.py

sbrugman · 2022-10-05T20:13:53Z

@akx Thanks for the thorough review! Addressed your comments in the last commit.

There are a bunch of long if isinstance(..., tuple) / ... style branches which seem to partially duplicate code and as such will eventually cause drift between the branches. I suppose these are to adapt rendering for one report or more than one report when comparing.

Similarly there are functions like _get_n for that, not to mention the @list_args decorator...

This also repeats for the templates, with if alerts is mapping sort of branches that also have a slightly different formatting..?

It would probably be nicer to not have those branches, but adapt the code everywhere to the general case of 1 or more reports.

a7b4c11 wouldn't probably be required separately if that was the case.

Agreed - added a note to the PR. This is out of scope for the feature itself and requires a redesign of the data structure used for the dataset summaries.

Should there be a test for adfa846?

Added a test case for this. One of the items on the wishlist is rewriting the test suits entirely.

Regarding the other comments: the commit history is not rewritten yet, this was pending the review of the code. Will rewrite once all feedback has been processed.

akx · 2022-10-07T10:46:33Z

Agreed - added a note to the PR. This is out of scope for the feature itself and requires a redesign of the data structure used for the dataset summaries.

My two Euro cents (or more accounting for inflation) I honestly think it would be better to do the work to make the data structures consistent now (or even in a separate PR this could then build upon), as opposed to adding special casing and "hacks" now and cleaning it up later.

sbrugman · 2022-10-09T21:34:34Z

@akx Agreed, the work needs to be done anyway. I've looked for a sensible split in the refactoring and am working on it - will request rereview when it's there.

To maintain backward compatibility (for now) and not blowing up this PR, the data structure will wrap each 'leaf' in the data structure 'tree' with a list. With that data structure, only one of the two branches (list/no list) are required, and most duplicate code can be removed. To not break backward compatibility for users that rely on the report.description_set, we'll only convert the structure once the report is generated.

The follow-up PR #1102 will take care of having a well-defined JSON schema fixing several issues.

aquemy · 2022-10-11T15:30:51Z

Hi @sbrugman,

I did a review and I did not find anything more than @akx already pointed out. It looks really neat!

However, I had a question or like a feature request.
One typical use case of a comparison report is to assess the data quality of two versions of the same dataset (e.g. after some data cleaning operations).
For this use case, it could be great to have a delta column to see in a glance the noticeable differences between the two versions.
So my question is: do you think this is something easy to do within this PR or it would be better to open another issue and discuss it more for a future iteration?

sbrugman · 2022-10-11T17:19:52Z

@aquemy That's a good point. Actually, in the first designs the delta was a central part. The challenge is providing a meaningful way to present the delta, without discarding information on the absolute compared values. Juxtapositioning is a sensible default: all values are there, the user can easily see the comparison.

In case you have concrete improvements on delta's in mind that increase clarity and are consistent with the layout then they are of course welcome. However this may be hard to know upfront, so we might discover them from user research that are working with the basic version.

fabclmnt · 2022-10-11T22:25:14Z

@aquemy That's a good point. Actually, in the first designs, the delta was a central part. The challenge is providing a meaningful way to present the delta, without discarding information on the absolute compared values. Juxtapositioning is a sensible default: all values are there, and the user can easily see the comparison.

In case you have concrete improvements on the delta in mind that increase clarity and are consistent with the layout then they are of course welcome. However this may be hard to know upfront, so we might discover them from user research that is working with the basic version.

Although I do find this is relevant and important for the comparison report, I would prefer to keep the first iteration simpler - without the computed delta. My suggestion is to keep this in a separate feature request.

I see no changes needed. This version is already looking great! One small request, and as @akx mentioned, please update the PR message to feat: (description)

docsrc/source/pages/use_cases/comparing_datasets.rst

fabclmnt · 2022-10-11T22:43:18Z

docsrc/source/pages/use_cases/comparing_datasets.rst

+
+.. pull-quote::
+
+    ⌛ Interested in uncovering more temporal patterns? Check out `popmon <https://github.com/ing-bank/popmon>`_.


Can you please remove the reference to popmon from the comparison report documentation?

fabclmnt

Small documentation changes request.

* fix: refactoring bugs * fix: update protected var labels for comparison * fix: add support to timeseries comparison * fix: style changes for readability * test: add simple run test

* fix: rewording Co-authored-by: Aarni Koskela <[email protected]>

* feat: add comparison validations

* feat: add new missing histogram plot * feat: add new missing matrix plot * feat: add new missing heatmap plot * feat: remove dendrogram

* feat: select only the left side of the comparison * chore: pre-commit fixes * fix: not intersection of columns

all requested changes are done

* ci: check for flake8 comprehensions * fix(config): configuration order is now respected * fix: index is no longer automatically added to dataframe * feat: correlation alerts show the name of the correlation * fix: strip tags from the title of the web report * feat: comparing two or more datasets (see docs) * docs(comparison): feature description * docs(readme): include reference to the dataset comparison use case * refactor: config private attribute * refactor: config update, exclude defaults * refactor: include style attribute in timeseries code * refactor: include style attribute in templates * test(comparisons): add tests for report comparison * refactor: overall correlation lowercase * refactor: frequency table kwargs * refactor: frequency table styling * refactor: fixing renderable tests * refactor: fixing renderable tests * style: formatting * refactor: senstive test * refactor: pass style argument * feat: check for empty dataframe * refactor: namespace invariant type check * refactor: ipywidgets fixes * refactor: ipywidgets no comparison support yet * refactor: process feedback * fix: comparison bugs (#1137) * fix: refactoring bugs * fix: update protected var labels for comparison * fix: add support to timeseries comparison * fix: style changes for readability * test: add simple run test * fix: reword comparison report doc (#1136) * fix: rewording Co-authored-by: Aarni Koskela <[email protected]> * feat: add comparison validations (#1143) * feat: add comparison validations * feat: replace missing plots to avoid dependencies' confilicts (#1148) * feat: add new missing histogram plot * feat: add new missing matrix plot * feat: add new missing heatmap plot * feat: remove dendrogram * feat: ignore columns not present on the base report (#1150) * feat: select only the left side of the comparison * chore: pre-commit fixes * fix: not intersection of columns * [skip ci] Code formatting * fix: missing plots columns order * [skip ci] Code formatting * fix: interactions/missing plot colors * fix: code formatting Co-authored-by: Aarni Koskela <[email protected]> Co-authored-by: Azory YData Bot <[email protected]> Co-authored-by: alexbarros <[email protected]>

sbrugman force-pushed the feat/report-comparison branch 2 times, most recently from b5f3f40 to 71ac411 Compare September 25, 2022 22:31

sbrugman linked an issue Sep 25, 2022 that may be closed by this pull request

Report Comparison #292

Closed

sbrugman mentioned this pull request Sep 25, 2022

Add the Configuration Parameter include_index #870

Closed

fabclmnt self-requested a review September 28, 2022 21:16

sbrugman force-pushed the feat/report-comparison branch 3 times, most recently from 302da3a to 54d24ea Compare October 1, 2022 23:32

sbrugman changed the title ~~[WIP] Report comparisons~~ Report comparisons Oct 2, 2022

sbrugman marked this pull request as ready for review October 2, 2022 00:45

akx reviewed Oct 5, 2022

View reviewed changes

sbrugman force-pushed the feat/report-comparison branch from 74947cd to b077d34 Compare October 5, 2022 21:26

sbrugman requested a review from akx October 5, 2022 21:27

sbrugman force-pushed the feat/report-comparison branch from b077d34 to da26545 Compare October 9, 2022 22:23

aquemy requested review from aquemy and removed request for akx October 11, 2022 15:31

aquemy approved these changes Oct 11, 2022

View reviewed changes

fabclmnt reviewed Oct 11, 2022

View reviewed changes

docsrc/source/pages/use_cases/comparing_datasets.rst Outdated Show resolved Hide resolved

fabclmnt reviewed Oct 11, 2022

View reviewed changes

docsrc/source/pages/use_cases/comparing_datasets.rst Outdated Show resolved Hide resolved

fabclmnt reviewed Oct 11, 2022

View reviewed changes

docsrc/source/pages/use_cases/comparing_datasets.rst Outdated Show resolved Hide resolved

fabclmnt reviewed Oct 11, 2022

View reviewed changes

fabclmnt previously requested changes Oct 11, 2022

View reviewed changes

fabclmnt requested a review from aquemy October 11, 2022 22:46

sbrugman and others added 10 commits November 18, 2022 07:48

feat: check for empty dataframe

a936df1

refactor: namespace invariant type check

b0acb2b

refactor: ipywidgets fixes

b2ccc9d

refactor: ipywidgets no comparison support yet

0fd0bb1

refactor: process feedback

51e8654

fix: comparison bugs (#1137)

f45c49d

* fix: refactoring bugs * fix: update protected var labels for comparison * fix: add support to timeseries comparison * fix: style changes for readability * test: add simple run test

fix: reword comparison report doc (#1136)

7428b2e

* fix: rewording Co-authored-by: Aarni Koskela <[email protected]>

feat: add comparison validations (#1143)

e2641cf

* feat: add comparison validations

feat: replace missing plots to avoid dependencies' confilicts (#1148)

9981fc5

* feat: add new missing histogram plot * feat: add new missing matrix plot * feat: add new missing heatmap plot * feat: remove dendrogram

feat: ignore columns not present on the base report (#1150)

b9f6665

* feat: select only the left side of the comparison * chore: pre-commit fixes * fix: not intersection of columns

alexbarros force-pushed the feat/report-comparison branch from dfe0ce5 to b9f6665 Compare November 18, 2022 11:06

sbrugman and others added 6 commits November 18, 2022 11:06

Merge b9f6665 into 7b73e8a

012090e

[skip ci] Code formatting

0533d90

fix: missing plots columns order

9db761e

Merge 9db761e into 7b73e8a

66c412b

[skip ci] Code formatting

03ee7dd

fix: interactions/missing plot colors

0066658

fix: code formatting

5d37756

aquemy approved these changes Nov 18, 2022

View reviewed changes

alexbarros merged commit e551d25 into develop Nov 18, 2022

alexbarros deleted the feat/report-comparison branch November 18, 2022 15:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Report comparisons #1069

Report comparisons #1069

sbrugman commented Sep 25, 2022 •

edited

Loading

codecov-commenter commented Oct 2, 2022 •

edited

Loading

akx left a comment

sbrugman commented Oct 5, 2022 •

edited

Loading

akx commented Oct 7, 2022

sbrugman commented Oct 9, 2022 •

edited

Loading

aquemy commented Oct 11, 2022

sbrugman commented Oct 11, 2022 •

edited

Loading

fabclmnt commented Oct 11, 2022

fabclmnt Oct 11, 2022

fabclmnt left a comment


		.. pull-quote::

		⌛ Interested in uncovering more temporal patterns? Check out `popmon <https://github.com/ing-bank/popmon>`_.

Report comparisons #1069

Report comparisons #1069

Conversation

sbrugman commented Sep 25, 2022 • edited Loading

codecov-commenter commented Oct 2, 2022 • edited Loading

Codecov Report

akx left a comment

Choose a reason for hiding this comment

sbrugman commented Oct 5, 2022 • edited Loading

akx commented Oct 7, 2022

sbrugman commented Oct 9, 2022 • edited Loading

aquemy commented Oct 11, 2022

sbrugman commented Oct 11, 2022 • edited Loading

fabclmnt commented Oct 11, 2022

fabclmnt Oct 11, 2022

Choose a reason for hiding this comment

fabclmnt left a comment

Choose a reason for hiding this comment

sbrugman commented Sep 25, 2022 •

edited

Loading

codecov-commenter commented Oct 2, 2022 •

edited

Loading

sbrugman commented Oct 5, 2022 •

edited

Loading

sbrugman commented Oct 9, 2022 •

edited

Loading

sbrugman commented Oct 11, 2022 •

edited

Loading