-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Report comparisons #1069
Report comparisons #1069
Conversation
b5f3f40
to
71ac411
Compare
302da3a
to
54d24ea
Compare
Codecov ReportBase: 90.90% // Head: 90.44% // Decreases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## develop #1069 +/- ##
===========================================
- Coverage 90.90% 90.44% -0.46%
===========================================
Files 178 181 +3
Lines 5090 5507 +417
===========================================
+ Hits 4627 4981 +354
- Misses 463 526 +63
Flags with carried forward coverage won't be shown. Click here to find out more.
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
🙌 for small granular commits :)
-
There are a bunch of long
if isinstance(..., tuple)
/ ... style branches which seem to partially duplicate code and as such will eventually cause drift between the branches. I suppose these are to adapt rendering for one report or more than one report when comparing.- Similarly there are functions like
_get_n
for that, not to mention the@list_args
decorator... - This also repeats for the templates, with
if alerts is mapping
sort of branches that also have a slightly different formatting..? - It would probably be nicer to not have those branches, but adapt the code everywhere to the general case of 1 or more reports.
- a7b4c11 wouldn't probably be required separately if that was the case.
- Similarly there are functions like
-
Should there be a test for adfa846?
-
Should the (mechanical?) reindentation of the template(s) happen in a separate commit?
-
There are some commits that don't really look
refactor:
to me, such as cbfc018, 85ad2c5, d7c583c, 3c97aed, 4828c15, 6354e90, fe181ea (what does "namespace invariant type check" actually mean?) -
54d24ea's commit message doesn't make much sense to me..? Should the new? behavior be documented?
-
To paraphrase @fabclmnt, can you follow the correct PR naming pattern?
@akx Thanks for the thorough review! Addressed your comments in the last commit.
Agreed - added a note to the PR. This is out of scope for the feature itself and requires a redesign of the data structure used for the dataset summaries.
Added a test case for this. One of the items on the wishlist is rewriting the test suits entirely. Regarding the other comments: the commit history is not rewritten yet, this was pending the review of the code. Will rewrite once all feedback has been processed. |
74947cd
to
b077d34
Compare
My two Euro cents (or more accounting for inflation) I honestly think it would be better to do the work to make the data structures consistent now (or even in a separate PR this could then build upon), as opposed to adding special casing and "hacks" now and cleaning it up later. |
@akx Agreed, the work needs to be done anyway. I've looked for a sensible split in the refactoring and am working on it - will request rereview when it's there. To maintain backward compatibility (for now) and not blowing up this PR, the data structure will wrap each 'leaf' in the data structure 'tree' with a list. With that data structure, only one of the two branches (list/no list) are required, and most duplicate code can be removed. To not break backward compatibility for users that rely on the The follow-up PR #1102 will take care of having a well-defined JSON schema fixing several issues. |
b077d34
to
da26545
Compare
Hi @sbrugman, I did a review and I did not find anything more than @akx already pointed out. It looks really neat! However, I had a question or like a feature request. |
@aquemy That's a good point. Actually, in the first designs the delta was a central part. The challenge is providing a meaningful way to present the delta, without discarding information on the absolute compared values. Juxtapositioning is a sensible default: all values are there, the user can easily see the comparison. In case you have concrete improvements on delta's in mind that increase clarity and are consistent with the layout then they are of course welcome. However this may be hard to know upfront, so we might discover them from user research that are working with the basic version. |
Although I do find this is relevant and important for the comparison report, I would prefer to keep the first iteration simpler - without the computed delta. My suggestion is to keep this in a separate I see no changes needed. This version is already looking great! One small request, and as @akx mentioned, please update the PR message to |
|
||
.. pull-quote:: | ||
|
||
⌛ Interested in uncovering more temporal patterns? Check out `popmon <https://github.com/ing-bank/popmon>`_. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please remove the reference to popmon from the comparison report documentation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small documentation changes request.
* fix: refactoring bugs * fix: update protected var labels for comparison * fix: add support to timeseries comparison * fix: style changes for readability * test: add simple run test
* fix: rewording Co-authored-by: Aarni Koskela <[email protected]>
* feat: add comparison validations
* feat: add new missing histogram plot * feat: add new missing matrix plot * feat: add new missing heatmap plot * feat: remove dendrogram
* feat: select only the left side of the comparison * chore: pre-commit fixes * fix: not intersection of columns
dfe0ce5
to
b9f6665
Compare
* ci: check for flake8 comprehensions * fix(config): configuration order is now respected * fix: index is no longer automatically added to dataframe * feat: correlation alerts show the name of the correlation * fix: strip tags from the title of the web report * feat: comparing two or more datasets (see docs) * docs(comparison): feature description * docs(readme): include reference to the dataset comparison use case * refactor: config private attribute * refactor: config update, exclude defaults * refactor: include style attribute in timeseries code * refactor: include style attribute in templates * test(comparisons): add tests for report comparison * refactor: overall correlation lowercase * refactor: frequency table kwargs * refactor: frequency table styling * refactor: fixing renderable tests * refactor: fixing renderable tests * style: formatting * refactor: senstive test * refactor: pass style argument * feat: check for empty dataframe * refactor: namespace invariant type check * refactor: ipywidgets fixes * refactor: ipywidgets no comparison support yet * refactor: process feedback * fix: comparison bugs (#1137) * fix: refactoring bugs * fix: update protected var labels for comparison * fix: add support to timeseries comparison * fix: style changes for readability * test: add simple run test * fix: reword comparison report doc (#1136) * fix: rewording Co-authored-by: Aarni Koskela <[email protected]> * feat: add comparison validations (#1143) * feat: add comparison validations * feat: replace missing plots to avoid dependencies' confilicts (#1148) * feat: add new missing histogram plot * feat: add new missing matrix plot * feat: add new missing heatmap plot * feat: remove dendrogram * feat: ignore columns not present on the base report (#1150) * feat: select only the left side of the comparison * chore: pre-commit fixes * fix: not intersection of columns * [skip ci] Code formatting * fix: missing plots columns order * [skip ci] Code formatting * fix: interactions/missing plot colors * fix: code formatting Co-authored-by: Aarni Koskela <[email protected]> Co-authored-by: Azory YData Bot <[email protected]> Co-authored-by: alexbarros <[email protected]>
* ci: check for flake8 comprehensions * fix(config): configuration order is now respected * fix: index is no longer automatically added to dataframe * feat: correlation alerts show the name of the correlation * fix: strip tags from the title of the web report * feat: comparing two or more datasets (see docs) * docs(comparison): feature description * docs(readme): include reference to the dataset comparison use case * refactor: config private attribute * refactor: config update, exclude defaults * refactor: include style attribute in timeseries code * refactor: include style attribute in templates * test(comparisons): add tests for report comparison * refactor: overall correlation lowercase * refactor: frequency table kwargs * refactor: frequency table styling * refactor: fixing renderable tests * refactor: fixing renderable tests * style: formatting * refactor: senstive test * refactor: pass style argument * feat: check for empty dataframe * refactor: namespace invariant type check * refactor: ipywidgets fixes * refactor: ipywidgets no comparison support yet * refactor: process feedback * fix: comparison bugs (#1137) * fix: refactoring bugs * fix: update protected var labels for comparison * fix: add support to timeseries comparison * fix: style changes for readability * test: add simple run test * fix: reword comparison report doc (#1136) * fix: rewording Co-authored-by: Aarni Koskela <[email protected]> * feat: add comparison validations (#1143) * feat: add comparison validations * feat: replace missing plots to avoid dependencies' confilicts (#1148) * feat: add new missing histogram plot * feat: add new missing matrix plot * feat: add new missing heatmap plot * feat: remove dendrogram * feat: ignore columns not present on the base report (#1150) * feat: select only the left side of the comparison * chore: pre-commit fixes * fix: not intersection of columns * [skip ci] Code formatting * fix: missing plots columns order * [skip ci] Code formatting * fix: interactions/missing plot colors * fix: code formatting Co-authored-by: Aarni Koskela <[email protected]> Co-authored-by: Azory YData Bot <[email protected]> Co-authored-by: alexbarros <[email protected]>
* ci: check for flake8 comprehensions * fix(config): configuration order is now respected * fix: index is no longer automatically added to dataframe * feat: correlation alerts show the name of the correlation * fix: strip tags from the title of the web report * feat: comparing two or more datasets (see docs) * docs(comparison): feature description * docs(readme): include reference to the dataset comparison use case * refactor: config private attribute * refactor: config update, exclude defaults * refactor: include style attribute in timeseries code * refactor: include style attribute in templates * test(comparisons): add tests for report comparison * refactor: overall correlation lowercase * refactor: frequency table kwargs * refactor: frequency table styling * refactor: fixing renderable tests * refactor: fixing renderable tests * style: formatting * refactor: senstive test * refactor: pass style argument * feat: check for empty dataframe * refactor: namespace invariant type check * refactor: ipywidgets fixes * refactor: ipywidgets no comparison support yet * refactor: process feedback * fix: comparison bugs (#1137) * fix: refactoring bugs * fix: update protected var labels for comparison * fix: add support to timeseries comparison * fix: style changes for readability * test: add simple run test * fix: reword comparison report doc (#1136) * fix: rewording Co-authored-by: Aarni Koskela <[email protected]> * feat: add comparison validations (#1143) * feat: add comparison validations * feat: replace missing plots to avoid dependencies' confilicts (#1148) * feat: add new missing histogram plot * feat: add new missing matrix plot * feat: add new missing heatmap plot * feat: remove dendrogram * feat: ignore columns not present on the base report (#1150) * feat: select only the left side of the comparison * chore: pre-commit fixes * fix: not intersection of columns * [skip ci] Code formatting * fix: missing plots columns order * [skip ci] Code formatting * fix: interactions/missing plot colors * fix: code formatting Co-authored-by: Aarni Koskela <[email protected]> Co-authored-by: Azory YData Bot <[email protected]> Co-authored-by: alexbarros <[email protected]>
* ci: check for flake8 comprehensions * fix(config): configuration order is now respected * fix: index is no longer automatically added to dataframe * feat: correlation alerts show the name of the correlation * fix: strip tags from the title of the web report * feat: comparing two or more datasets (see docs) * docs(comparison): feature description * docs(readme): include reference to the dataset comparison use case * refactor: config private attribute * refactor: config update, exclude defaults * refactor: include style attribute in timeseries code * refactor: include style attribute in templates * test(comparisons): add tests for report comparison * refactor: overall correlation lowercase * refactor: frequency table kwargs * refactor: frequency table styling * refactor: fixing renderable tests * refactor: fixing renderable tests * style: formatting * refactor: senstive test * refactor: pass style argument * feat: check for empty dataframe * refactor: namespace invariant type check * refactor: ipywidgets fixes * refactor: ipywidgets no comparison support yet * refactor: process feedback * fix: comparison bugs (#1137) * fix: refactoring bugs * fix: update protected var labels for comparison * fix: add support to timeseries comparison * fix: style changes for readability * test: add simple run test * fix: reword comparison report doc (#1136) * fix: rewording Co-authored-by: Aarni Koskela <[email protected]> * feat: add comparison validations (#1143) * feat: add comparison validations * feat: replace missing plots to avoid dependencies' confilicts (#1148) * feat: add new missing histogram plot * feat: add new missing matrix plot * feat: add new missing heatmap plot * feat: remove dendrogram * feat: ignore columns not present on the base report (#1150) * feat: select only the left side of the comparison * chore: pre-commit fixes * fix: not intersection of columns * [skip ci] Code formatting * fix: missing plots columns order * [skip ci] Code formatting * fix: interactions/missing plot colors * fix: code formatting Co-authored-by: Aarni Koskela <[email protected]> Co-authored-by: Azory YData Bot <[email protected]> Co-authored-by: alexbarros <[email protected]>
* ci: check for flake8 comprehensions * fix(config): configuration order is now respected * fix: index is no longer automatically added to dataframe * feat: correlation alerts show the name of the correlation * fix: strip tags from the title of the web report * feat: comparing two or more datasets (see docs) * docs(comparison): feature description * docs(readme): include reference to the dataset comparison use case * refactor: config private attribute * refactor: config update, exclude defaults * refactor: include style attribute in timeseries code * refactor: include style attribute in templates * test(comparisons): add tests for report comparison * refactor: overall correlation lowercase * refactor: frequency table kwargs * refactor: frequency table styling * refactor: fixing renderable tests * refactor: fixing renderable tests * style: formatting * refactor: senstive test * refactor: pass style argument * feat: check for empty dataframe * refactor: namespace invariant type check * refactor: ipywidgets fixes * refactor: ipywidgets no comparison support yet * refactor: process feedback * fix: comparison bugs (#1137) * fix: refactoring bugs * fix: update protected var labels for comparison * fix: add support to timeseries comparison * fix: style changes for readability * test: add simple run test * fix: reword comparison report doc (#1136) * fix: rewording Co-authored-by: Aarni Koskela <[email protected]> * feat: add comparison validations (#1143) * feat: add comparison validations * feat: replace missing plots to avoid dependencies' confilicts (#1148) * feat: add new missing histogram plot * feat: add new missing matrix plot * feat: add new missing heatmap plot * feat: remove dendrogram * feat: ignore columns not present on the base report (#1150) * feat: select only the left side of the comparison * chore: pre-commit fixes * fix: not intersection of columns * [skip ci] Code formatting * fix: missing plots columns order * [skip ci] Code formatting * fix: interactions/missing plot colors * fix: code formatting Co-authored-by: Aarni Koskela <[email protected]> Co-authored-by: Azory YData Bot <[email protected]> Co-authored-by: alexbarros <[email protected]>
* ci: check for flake8 comprehensions * fix(config): configuration order is now respected * fix: index is no longer automatically added to dataframe * feat: correlation alerts show the name of the correlation * fix: strip tags from the title of the web report * feat: comparing two or more datasets (see docs) * docs(comparison): feature description * docs(readme): include reference to the dataset comparison use case * refactor: config private attribute * refactor: config update, exclude defaults * refactor: include style attribute in timeseries code * refactor: include style attribute in templates * test(comparisons): add tests for report comparison * refactor: overall correlation lowercase * refactor: frequency table kwargs * refactor: frequency table styling * refactor: fixing renderable tests * refactor: fixing renderable tests * style: formatting * refactor: senstive test * refactor: pass style argument * feat: check for empty dataframe * refactor: namespace invariant type check * refactor: ipywidgets fixes * refactor: ipywidgets no comparison support yet * refactor: process feedback * fix: comparison bugs (#1137) * fix: refactoring bugs * fix: update protected var labels for comparison * fix: add support to timeseries comparison * fix: style changes for readability * test: add simple run test * fix: reword comparison report doc (#1136) * fix: rewording Co-authored-by: Aarni Koskela <[email protected]> * feat: add comparison validations (#1143) * feat: add comparison validations * feat: replace missing plots to avoid dependencies' confilicts (#1148) * feat: add new missing histogram plot * feat: add new missing matrix plot * feat: add new missing heatmap plot * feat: remove dendrogram * feat: ignore columns not present on the base report (#1150) * feat: select only the left side of the comparison * chore: pre-commit fixes * fix: not intersection of columns * [skip ci] Code formatting * fix: missing plots columns order * [skip ci] Code formatting * fix: interactions/missing plot colors * fix: code formatting Co-authored-by: Aarni Koskela <[email protected]> Co-authored-by: Azory YData Bot <[email protected]> Co-authored-by: alexbarros <[email protected]>
* ci: check for flake8 comprehensions * fix(config): configuration order is now respected * fix: index is no longer automatically added to dataframe * feat: correlation alerts show the name of the correlation * fix: strip tags from the title of the web report * feat: comparing two or more datasets (see docs) * docs(comparison): feature description * docs(readme): include reference to the dataset comparison use case * refactor: config private attribute * refactor: config update, exclude defaults * refactor: include style attribute in timeseries code * refactor: include style attribute in templates * test(comparisons): add tests for report comparison * refactor: overall correlation lowercase * refactor: frequency table kwargs * refactor: frequency table styling * refactor: fixing renderable tests * refactor: fixing renderable tests * style: formatting * refactor: senstive test * refactor: pass style argument * feat: check for empty dataframe * refactor: namespace invariant type check * refactor: ipywidgets fixes * refactor: ipywidgets no comparison support yet * refactor: process feedback * fix: comparison bugs (#1137) * fix: refactoring bugs * fix: update protected var labels for comparison * fix: add support to timeseries comparison * fix: style changes for readability * test: add simple run test * fix: reword comparison report doc (#1136) * fix: rewording Co-authored-by: Aarni Koskela <[email protected]> * feat: add comparison validations (#1143) * feat: add comparison validations * feat: replace missing plots to avoid dependencies' confilicts (#1148) * feat: add new missing histogram plot * feat: add new missing matrix plot * feat: add new missing heatmap plot * feat: remove dendrogram * feat: ignore columns not present on the base report (#1150) * feat: select only the left side of the comparison * chore: pre-commit fixes * fix: not intersection of columns * [skip ci] Code formatting * fix: missing plots columns order * [skip ci] Code formatting * fix: interactions/missing plot colors * fix: code formatting Co-authored-by: Aarni Koskela <[email protected]> Co-authored-by: Azory YData Bot <[email protected]> Co-authored-by: alexbarros <[email protected]>
* ci: check for flake8 comprehensions * fix(config): configuration order is now respected * fix: index is no longer automatically added to dataframe * feat: correlation alerts show the name of the correlation * fix: strip tags from the title of the web report * feat: comparing two or more datasets (see docs) * docs(comparison): feature description * docs(readme): include reference to the dataset comparison use case * refactor: config private attribute * refactor: config update, exclude defaults * refactor: include style attribute in timeseries code * refactor: include style attribute in templates * test(comparisons): add tests for report comparison * refactor: overall correlation lowercase * refactor: frequency table kwargs * refactor: frequency table styling * refactor: fixing renderable tests * refactor: fixing renderable tests * style: formatting * refactor: senstive test * refactor: pass style argument * feat: check for empty dataframe * refactor: namespace invariant type check * refactor: ipywidgets fixes * refactor: ipywidgets no comparison support yet * refactor: process feedback * fix: comparison bugs (#1137) * fix: refactoring bugs * fix: update protected var labels for comparison * fix: add support to timeseries comparison * fix: style changes for readability * test: add simple run test * fix: reword comparison report doc (#1136) * fix: rewording Co-authored-by: Aarni Koskela <[email protected]> * feat: add comparison validations (#1143) * feat: add comparison validations * feat: replace missing plots to avoid dependencies' confilicts (#1148) * feat: add new missing histogram plot * feat: add new missing matrix plot * feat: add new missing heatmap plot * feat: remove dendrogram * feat: ignore columns not present on the base report (#1150) * feat: select only the left side of the comparison * chore: pre-commit fixes * fix: not intersection of columns * [skip ci] Code formatting * fix: missing plots columns order * [skip ci] Code formatting * fix: interactions/missing plot colors * fix: code formatting Co-authored-by: Aarni Koskela <[email protected]> Co-authored-by: Azory YData Bot <[email protected]> Co-authored-by: alexbarros <[email protected]>
This pull request introduces the report comparison functionality into
pandas-profiling
. Various other fixes/changes are included.pandas-profiling
can be used to compare multiple version of the same dataset.This is useful when comparing data from multiple time periods, such as two years.
Another common scenario is to view the dataset profile for training, validation and test sets in machine learning.
The following syntax can be used to compare two datasets:
Settings.html.style.primary_color
is replaced by a sequence of colorsSettings.html.style.primary_colors
. For backwards compatibility, the first element of the sequence is still accessible through theprimary_color
attribute, however deprecation in the future is desired. The comparison feature requires to change the configuration of colors. Where previously there was no distinction between the parts specific to the report, and parts specific to a single dataset summary, this distinction is introduced now. It could be misleading to simply use the first of the colors as a report color (The first#377eb8
now is close to the default report color#337ab7
). Future work could go into configuring the report color and chosing better defaults.Refactoring the summary datastructure is partially out of scope for this PR. (For this there are other design decisions that have been considered before), see: #1102