-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
chore(examples): dataset compare examples (#1167)
* build(deps): bump cycjimmy/semantic-release-action from 2 to 3 (#1154) * chore(actions): disable lint when prs come from dependabot (#1164) * chore(actions): fix push and latest tag configs (#1166) * docs(changelogs): fix changelog format (#1163) * chore: move example files and add new hcc example Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Vasco Ramos <[email protected]>
- Loading branch information
1 parent
40a62b8
commit 47d3aeb
Showing
11 changed files
with
234 additions
and
137 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -25,8 +25,8 @@ jobs: | |
uses: oprypin/[email protected] | ||
with: | ||
repository: ${{ github.repository }} | ||
regex: '^\d+\.\d+\.\d+' | ||
releases-only: false | ||
regex: '^v\d+\.\d+\.\d+' | ||
releases-only: true | ||
|
||
- name: Extract semantic version | ||
id: semantic | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
File renamed without changes.
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,181 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Pandas Profiling: HCC Dataset\n", | ||
"Source of data: https://www.kaggle.com/datasets/mrsantos/hcc-dataset\n", | ||
"\n", | ||
"As modifiations have been introduced for the purpose of this use case, the .csv file is provided (hcc.csv)." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Import libraries" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"import pandas as pd\n", | ||
"\n", | ||
"from pandas_profiling import ProfileReport" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Load the dataset" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# Read the HCC Dataset\n", | ||
"df = pd.read_csv(\"hcc.csv\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Produce and save the profiling report" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"original_report = ProfileReport(df, title=\"Original Data\")\n", | ||
"original_report.to_file(\"original_report.html\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Analysis of \"Alerts\"\n", | ||
"Pandas Profiling alerts for the presence of 4 potential data quality problems:\n", | ||
"\n", | ||
"- `DUPLICATES`: 4 duplicate rows in data\n", | ||
"- `CONSTANT`: Constant value “999” in ‘O2’\n", | ||
"- `HIGH CORRELATION`: Several features marked as highly correlated\n", | ||
"- `MISSING`: Missing Values in ‘Ferritin’\n", | ||
"\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### Removing Duplicate Rows" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# Drop duplicate rows\n", | ||
"df_transformed = df.copy()\n", | ||
"df_transformed = df_transformed.drop_duplicates()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### Removing Irrelevant Features" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# Remove O2\n", | ||
"df_transformed = df_transformed.drop(columns=\"O2\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### Missing Data Imputation" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# Impute Missing Values\n", | ||
"from sklearn.impute import SimpleImputer\n", | ||
"\n", | ||
"mean_imputer = SimpleImputer(strategy=\"mean\")\n", | ||
"df_transformed[\"Ferritin\"] = mean_imputer.fit_transform(\n", | ||
" df_transformed[\"Ferritin\"].values.reshape(-1, 1)\n", | ||
")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Produce Comparison Report" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"transformed_report = ProfileReport(df_transformed, title=\"Transformed Data\")\n", | ||
"comparison_report = original_report.compare(transformed_report)\n", | ||
"comparison_report.to_file(\"original_vs_transformed.html\")" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3.10.8 ('feat-comp')", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.10.8" | ||
}, | ||
"vscode": { | ||
"interpreter": { | ||
"hash": "13390b9b50dde76c6c011e02183633aae7d8498993a6e6577a16e1b7cb8c7a8c" | ||
} | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,34 @@ | ||
""" | ||
Comparison report example for HCC dataset | ||
""" | ||
import pandas as pd | ||
from sklearn.impute import SimpleImputer | ||
|
||
from pandas_profiling import ProfileReport | ||
|
||
if __name__ == "__main__": | ||
|
||
# Load the dataset | ||
df = pd.read_csv("hcc.csv") | ||
|
||
# Produce profile report | ||
original_report = ProfileReport(df, title="Original Data") | ||
original_report.to_file("original_report.html") | ||
|
||
# Drop duplicate rows | ||
df_transformed = df.copy() | ||
df_transformed = df_transformed.drop_duplicates() | ||
|
||
# Remove O2 | ||
df_transformed = df_transformed.drop(columns="O2") | ||
|
||
# Impute Missing Values | ||
mean_imputer = SimpleImputer(strategy="mean") | ||
df_transformed["Ferritin"] = mean_imputer.fit_transform( | ||
df_transformed["Ferritin"].values.reshape(-1, 1) | ||
) | ||
|
||
# Produce comparison report | ||
transformed_report = ProfileReport(df_transformed, title="Transformed Data") | ||
comparison_report = original_report.compare(transformed_report) | ||
comparison_report.to_file("original_vs_transformed.html") |
Oops, something went wrong.