Skip to content

Commit

Permalink
Merge pull request #327 from BCG-Gamma/dev/1.1.2
Browse files Browse the repository at this point in the history
 BUILD: release facet 1.1.2
  • Loading branch information
j-ittner authored Feb 22, 2022
2 parents ad5e197 + fb5ba8f commit 095b35b
Show file tree
Hide file tree
Showing 18 changed files with 187 additions and 588 deletions.
54 changes: 34 additions & 20 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -90,14 +90,14 @@ Enhanced Machine Learning Workflow
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

To demonstrate the model inspection capability of FACET, we first create a
pipeline to fit a learner. In this simple example we use the
`diabetes dataset <https://www4.stat.ncsu.edu/~boos/var.select/diabetes.tab.txt>`__
pipeline to fit a learner. In this simple example we will use the
`diabetes dataset <https://web.stanford.edu/~hastie/Papers/LARS/diabetes.data>`__
which contains age, sex, BMI and blood pressure along with 6 blood serum
measurements as features. A transformed version of this dataset is also available
on scikit-learn
measurements as features. This dataset was used in this
`publication <https://statweb.stanford.edu/~tibs/ftp/lars.pdf>`__.
A transformed version of this dataset is also available on scikit-learn
`here <https://scikit-learn.org/stable/datasets/toy_dataset.html#diabetes-dataset>`__.


In this quickstart we will train a Random Forest regressor using 10 repeated
5-fold CV to predict disease progression after one year. With the use of
*sklearndf* we can create a *pandas* DataFrame compatible workflow. However,
Expand All @@ -119,8 +119,22 @@ hyperparameter configurations and even multiple learners with the `LearnerRanker
from facet.data import Sample
from facet.selection import LearnerRanker, LearnerGrid
# load the diabetes dataset
diabetes_df = pd.read_csv('diabetes_quickstart.csv')
# declaring url with data
data_url = 'https://web.stanford.edu/~hastie/Papers/LARS/diabetes.data'
#importing data from url
diabetes_df = pd.read_csv(data_url, delimiter='\t').rename(
# renaming columns for better readability
columns={
'S1': 'TC', # total serum cholesterol
'S2': 'LDL', # low-density lipoproteins
'S3': 'HDL', # high-density lipoproteins
'S4': 'TCH', # total cholesterol/ HDL
'S5': 'LTG', # lamotrigine level
'S6': 'GLU', # blood sugar level
'Y': 'Disease_progression' # measure of progress since 1yr of baseline
}
)
# create FACET sample object
diabetes_sample = Sample(observations=diabetes_df, target_name="Disease_progression")
Expand Down Expand Up @@ -236,10 +250,10 @@ The key global metrics for each pair of features in a model are:

For any feature pair (A, B), the first feature (A) is the row, and the second
feature (B) the column. For example, looking across the row for `LTG` (Lamotrigine)
there is relatively minimal synergy (≤1%) with other features in the model.
However, looking down the column for `LTG` (i.e., perspective of other features
in a pair with `LTG`) we find many features (the rows) are synergistic (up to 27%)
with `LTG`. We can conclude that:
there is hardly any synergy with other features in the model (≤ 1%).
However, looking down the column for `LTG` (i.e., from the perspective of other features
relative with `LTG`) we find that many features (the rows) are aided by synergy with
with `LTG` (up to 27% in the case of LDL). We conclude that:

- `LTG` is a strongly autonomous feature, displaying minimal synergy with other
features for predicting disease progression after one year.
Expand All @@ -248,7 +262,7 @@ with `LTG`. We can conclude that:

High synergy between pairs of features must be considered carefully when investigating
impact, as the values of both features jointly determine the outcome. It would not make
much sense to consider `TC` (T-Cells) without the context provided by `LDL` given close
much sense to consider `LDL` without the context provided by `LTG` given close
to 27% synergy of `LDL` with `LTG` for predicting progression after one year.

**Redundancy**
Expand All @@ -267,12 +281,12 @@ For any feature pair (A, B), the first feature (A) is the row, and the second fe
(B) the column. For example, if we look at the feature pair (`LDL`, `TC`) from the
perspective of `LDL` (Low-Density Lipoproteins), then we look-up the row for `LDL`
and the column for `TC` and find 38% redundancy. This means that 38% of the information
in `LDL` is duplicated with `TC` to predict disease progression after one year. This
in `LDL` to predict disease progression is duplicated in `TC`. This
redundancy is the same when looking "from the perspective" of `TC` for (`TC`, `LDL`),
but need not be symmetrical in all cases (see `LTG` vs. `TSH`).
but need not be symmetrical in all cases (see `LTG` vs. `TCH`).

If we look at `TSH`, it has between 22–32% redundancy each with `LTG` and `HDL`, but
the same does not hold between `LTG` and `HDL` – meaning `TSH` shares different
If we look at `TCH`, it has between 22–32% redundancy each with `LTG` and `HDL`, but
the same does not hold between `LTG` and `HDL` – meaning `TCH` shares different
information with each of the two features.


Expand Down Expand Up @@ -302,9 +316,9 @@ Let's look at the example for redundancy.
:width: 600

Based on the dendrogram we can see that the feature pairs (`LDL`, `TC`)
and (`HDL`, `TSH`) each represent a cluster in the dendrogram and that `LTG` and `BMI`
and (`HDL`, `TCH`) each represent a cluster in the dendrogram and that `LTG` and `BMI`
have the highest importance. As potential next actions we could explore the impact of
removing `TSH`, and one of `TC` or `LDL` to further simplify the model and obtain a
removing `TCH`, and one of `TC` or `LDL` to further simplify the model and obtain a
reduced set of independent features.

Please see the
Expand Down Expand Up @@ -369,7 +383,7 @@ quantify the uncertainty by using bootstrap confidence intervals.
.. image:: sphinx/source/_static/simulation_output.png

We would conclude from the figure that higher values of `BMI` are associated with
an increase in disease progression after one year, and that for a `BMI` of 29
an increase in disease progression after one year, and that for a `BMI` of 28
and above, there is a significant increase in disease progression after one year
of at least 26 points.

Expand Down Expand Up @@ -447,7 +461,7 @@ or have a look at
.. |azure_build| image:: https://dev.azure.com/gamma-facet/facet/_apis/build/status/BCG-Gamma.facet?repoName=BCG-Gamma%2Ffacet&branchName=develop
:target: https://dev.azure.com/gamma-facet/facet/_build?definitionId=7&_a=summary

.. |azure_code_cov| image:: https://img.shields.io/azure-devops/coverage/gamma-facet/facet/7/develop.svg
.. |azure_code_cov| image:: https://img.shields.io/azure-devops/coverage/gamma-facet/facet/7/1.1.x
:target: https://dev.azure.com/gamma-facet/facet/_build?definitionId=7&_a=summary

.. |python_versions| image:: https://img.shields.io/badge/python-3.6|3.7|3.8-blue.svg
Expand Down
13 changes: 12 additions & 1 deletion RELEASE_NOTES.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,17 @@ FACET 1.1
FACET 1.1 refines and enhances the association/synergy/redundancy calculations provided
by the :class:`.LearnerInspector`.


1.1.2
~~~~~

- DOC: use a downloadable dataset in the `getting started` notebook
- FIX: import :mod:`catboost` if present, else create a local module mockup
- FIX: correctly identify if ``sample_weights`` is undefined when re-fitting a model
on the full dataset in a :class:`.LearnerCrossfit`
- BUILD: relax package dependencies to support any `numpy` version 1.`x` from 1.16


1.1.1
~~~~~

Expand All @@ -26,7 +37,7 @@ by the :class:`.LearnerInspector`.
model in a crossfit, then returns the mean of all resulting matrices. This leads to a
slight increase in accuracy, and also allows us to calculate the standard deviation
across matrices as an indication of confidence for each calculated value.
- API: Method :meth:`.LernerInspector.shap_plot_data` now returns SHAP values for the
- API: Method :meth:`.LearnerInspector.shap_plot_data` now returns SHAP values for the
positive class of binary classifiers.
- API: Increase efficiency of :class:`.LearnerRanker` parallelization by adopting the
new :class:`pytools.parallelization.JobRunner` API provided by :mod:`pytools`
Expand Down
22 changes: 13 additions & 9 deletions azure-pipelines.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,9 @@ pr:
- 1.1.x
- release/*

pool:
vmImage: 'Ubuntu-latest'

# set the build name
name: $[ variables['branchName'] ]

Expand All @@ -30,8 +33,8 @@ variables:
branchName: $[ replace(variables['Build.SourceBranch'], 'refs/heads/', '') ]
${{ if startsWith(variables['Build.SourceBranch'], 'refs/pull/') }}:
branchName: $[ replace(variables['System.PullRequest.SourceBranch'], 'refs/heads/', '') ]
source_is_release_branch: $[ startsWith(variables['branchName'], 'release') ]
source_is_develop_branch: $[ or(startsWith(variables['branchName'], 'develop'), startsWith(variables['branchName'], 'dev/')) ]
source_is_release_branch: $[ startsWith(variables['branchName'], 'release/') ]
source_is_develop_branch: $[ startsWith(variables['branchName'], 'dev/') ]
is_scheduled: $[ eq(variables['Build.Reason'], 'Schedule') ]
project_name: facet
project_root: $(project_name)
Expand Down Expand Up @@ -97,7 +100,7 @@ stages:
cd $(System.DefaultWorkingDirectory)
files_changed=$(git diff $(Build.SourceVersion)^! --name-only)
echo "Files changed since last commit: ${files_changed}"
n_files_changed=$(git diff $(Build.SourceVersion)^! --name-only | grep -i -E 'meta.yaml|pyproject.toml|azure-pipelines.yml|tox.ini' | wc -l | xargs)
n_files_changed=$(git diff $(Build.SourceVersion)^! --name-only | grep -i -E 'meta\.yaml|pyproject\.toml|azure-pipelines\.yml|tox\.ini|make\.py' | wc -l | xargs)
if [ ${n_files_changed} -gt 0 ]
then
build_changed=1
Expand Down Expand Up @@ -210,7 +213,7 @@ stages:
- script: dir $(Build.SourcesDirectory)

- script: |
conda install -y -c anaconda conda-build~=3.20.5 conda-verify toml=0.10.* flit=3.0.*
conda install -y -c anaconda conda-build~=3.21 conda-verify toml=0.10.* flit=3.0.* packaging~=20.9
displayName: 'Install conda-build, flit, toml'
condition: eq(variables['BUILD_SYSTEM'], 'conda')
Expand Down Expand Up @@ -297,7 +300,7 @@ stages:
- script: dir $(Build.SourcesDirectory)

- script: |
conda install -y -c anaconda conda-build~=3.20.5 conda-verify toml=0.10.* flit=3.0.*
conda install -y -c anaconda conda-build~=3.21 conda-verify toml=0.10.* flit=3.0.* packaging~=20.9
displayName: 'Install conda-build, flit, toml'
condition: eq(variables['BUILD_SYSTEM'], 'conda')
Expand Down Expand Up @@ -398,7 +401,7 @@ stages:
condition: ne(variables.branchName, 'develop')
script: |
set -eux
python -m pip install "toml==0.10.*"
python -m pip install toml~=0.10.2 packaging~=20.9
cd $(System.DefaultWorkingDirectory)/pytools
python <<EOF
from os import environ
Expand Down Expand Up @@ -461,13 +464,14 @@ stages:
script: |
set -eux
echo "Getting version"
pip install packaging
pip install packaging~=20.9
cd $(System.DefaultWorkingDirectory)/$(project_root)/src
export PYTHONPATH=$(System.DefaultWorkingDirectory)/pytools/sphinx/base
version=$(python -c "import make_base; print(make_base.get_package_version())")
echo "Current version: $version"
echo "Detecting pre-release ('rc' in version)"
echo "Detecting pre-release ('dev' or 'rc' in version)"
prerelease=False
[[ $version == *dev* ]] && prerelease=True && echo "Development release identified"
[[ $version == *rc* ]] && prerelease=True && echo "Pre-release identified"
echo "##vso[task.setvariable variable=current_version]$version"
echo "##vso[task.setvariable variable=is_prerelease]$prerelease"
Expand Down Expand Up @@ -541,7 +545,7 @@ stages:
conda install -c conda-forge -c bcg_gamma $(package_name)
isDraft: false
isPreRelease: $(is_prerelease)
isPrerelease: $(is_prerelease)
assets: |
$(System.ArtifactsDirectory)/tox_default/tox/$(package_name)-*.tar.gz
$(System.ArtifactsDirectory)/conda_default/conda/noarch/$(package_name)-*.tar.bz2
Expand Down
18 changes: 12 additions & 6 deletions condabuild/meta.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -12,22 +12,19 @@ build:
requirements:
host:
- flit>=3.0.*
- numpy {{ environ.get('FACET_V_NUMPY', '>=1.11.*') }}
- pip>=20.*
- python=3.8.*
- numpy {{ environ.get('FACET_V_NUMPY', '>=1.11') }}
- pip>=20
- python=3.8
run:
- gamma-pytools {{ environ.get('FACET_V_GAMMA_PYTOOLS') }}
- lightgbm {{ environ.get('FACET_V_LIGHTGBM') }}
- matplotlib {{ environ.get('FACET_V_MATPLOTLIB') }}
- numpy {{ environ.get('FACET_V_NUMPY') }}
- packaging {{ environ.get('FACET_V_PACKAGING') }}
- pandas {{ environ.get('FACET_V_PANDAS') }}
- python {{ environ.get('FACET_V_PYTHON') }}
- scipy {{ environ.get('FACET_V_SCIPY') }}
- shap {{ environ.get('FACET_V_SHAP') }}
- scikit-learn {{ environ.get('FACET_V_SCIKIT_LEARN') }}
- sklearndf {{ environ.get('FACET_V_SKLEARNDF') }}
- typing_inspect {{ environ.get('FACET_V_TYPING_INSPECT') }}

test:
imports:
Expand All @@ -39,6 +36,15 @@ test:
- facet.simulation
requires:
- pytest=5.2.*
# additional requirements of sklearndf
- boruta_py {{ environ.get('FACET_V_BORUTA') }}
- lightgbm {{ environ.get('FACET_V_LIGHTGBM') }}
- scikit-learn {{ environ.get('FACET_V_SCIKIT_LEARN') }}
# additional requirements of gamma-pytools
- ipython {{ environ.get('FACET_V_IPYTHON') }}
- joblib {{ environ.get('FACET_V_JOBLIB') }}
# additional requirements of shap
- typing_inspect {{ environ.get('FACET_V_TYPING_INSPECT') }}
commands:
- conda list
- python -c 'import facet;
Expand Down
42 changes: 20 additions & 22 deletions environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,43 +6,41 @@ dependencies:
# run
- boruta_py ~= 0.3
- gamma-pytools ~= 1.1
- joblib = 1.*
- joblib ~= 1.1
- lightgbm ~= 3.2
- matplotlib ~= 3.3
- numpy ~= 1.16
- numpy ~= 1.22
- pandas ~= 1.2
- python ~= 3.8
- scikit-learn ~= 0.23.1
- scipy = 1.5.*
- scipy ~= 1.5.3
- shap ~= 0.39.0
- sklearndf ~= 1.1
# build/test
- black = 20.8b1
- conda-build ~= 3.20
- conda-verify ~= 3.1
- docutils = 0.16.*
- flit = 3.0.*
- conda-build ~= 3.21.8
- conda-verify ~= 3.1.1
- docutils ~= 0.16.0
- flit ~= 3.0.0
- isort ~= 5.5
- jinja2 ~= 2.11
- m2r = 0.2.*
- markupsafe < 2.1 # markupsafe 2.1 breaks support for jinja2
- m2r ~= 0.2.0
- pluggy ~= 0.13
- pre-commit ~= 2.7
- pydata-sphinx-theme = 0.4.*
- pydata-sphinx-theme ~= 0.4.0
- pytest ~= 5.2
- pytest-cov ~= 2.8
- pyyaml ~= 5.1
- sphinx = 3.4.*
- sphinx-autodoc-typehints = 1.11.*
- toml = 0.10.*
- tox = 3.20.*
- sphinx ~= 3.4.0
- sphinx-autodoc-typehints ~= 1.11.0
- toml ~= 0.10.0
- tox ~= 3.20.0
- yaml ~= 0.2
# notebooks
- jupyterlab = 3.*
- jupyterlab ~= 3.1
- nbclassic ~= 0.2.8
- nbsphinx = 0.7.*
- openpyxl = 3.*
- seaborn = 0.11.*
- tableone = 0.7.*
# pip
- pip >= 20
- pip:
- shap >=0.34,<0.40
- nbsphinx ~= 0.7.0
- openpyxl ~= 3.0
- seaborn ~= 0.11.0
- tableone ~= 0.7.0
26 changes: 12 additions & 14 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -15,16 +15,14 @@ license = "Apache Software License v2.0"

requires = [
# direct requirements of gamma-facet
"gamma-pytools ~=1.1,>=1.1.2",
"gamma-pytools ~=1.1.2",
"matplotlib ~=3.0",
"numpy >=1.16,<1.21a",
"numpy >=1.16,<2a",
"packaging ~=20.0",
"pandas >=0.24,<2a",
"scipy ~=1.2",
"shap >=0.34,<0.40a",
"sklearndf ~=1.1",
# additional requirements of shap 0.38
"ipython >=7",
"sklearndf ~=1.1.0",
]

requires-python = ">=3.6,<4a"
Expand Down Expand Up @@ -71,15 +69,15 @@ Repository = "https://github.com/BCG-Gamma/facet"

[build.matrix.min]
# direct requirements of gamma-facet
gamma-pytools = "~=1.1.2"
gamma-pytools = "~=1.1.6"
matplotlib = "~=3.0.3"
numpy = ">=1.16.6,<17a"
numpy = "==1.16.6"
packaging = "~=20.9"
pandas = "~=0.24.2"
python = "~=3.6.13"
python = "~=3.6.15"
scipy = "~=1.2.1"
shap = "~=0.34.0"
sklearndf = "~=1.1.0"
sklearndf = "~=1.1.3"
# additional minimum requirements of sklearndf
boruta = "~=0.3.0"
lightgbm = "~=3.0.0"
Expand All @@ -88,19 +86,19 @@ scikit-learn = "~=0.21.3"
joblib = "~=0.14.1"
typing_inspect = "~=0.4.0"
# additional minimum requirements of shap
ipython = "~=7.0"
ipython = "==7.0"

[build.matrix.max]
# direct requirements of gamma-facet
gamma-pytools = "~=1.1,>=1.1.4"
gamma-pytools = "~=1.1.6"
matplotlib = "~=3.3"
numpy = ">=1.20,<2a"
numpy = ">=1.22,<2a"
packaging = "~=20.9"
pandas = "~=1.2"
pandas = "~=1.4"
python = "~=3.8"
scipy = "~=1.5.3"
shap = "~=0.39.0"
sklearndf = "~=1.1"
sklearndf = "~=1.1.3"
# additional maximum requirements of sklearndf
boruta = "~=0.3"
lightgbm = "~=3.2"
Expand Down
Loading

0 comments on commit 095b35b

Please sign in to comment.