Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset statistics #5768

Merged
merged 68 commits into from
Jan 10, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
68 commits
Select commit Hold shift + click to select a range
7e2db93
first pass at adding a pandas statistics to gollm task-runner
YohannParis Dec 5, 2024
cfa69f0
Merge branch 'main' into dataset-statistics
YohannParis Dec 5, 2024
000a912
Merge branch 'main' into dataset-statistics
YohannParis Dec 5, 2024
315ef25
Merge branch 'main' into dataset-statistics
YohannParis Dec 16, 2024
8397e13
Merge branch 'main' into dataset-statistics
YohannParis Dec 17, 2024
88218d4
remove the unnecessary util file
YohannParis Dec 17, 2024
3a67e93
Add messages and python/pandas type to ColumnType
YohannParis Dec 17, 2024
7785c63
Merge branch 'main' into dataset-statistics
YohannParis Dec 17, 2024
72a5622
Create DatasetStatistics.java
YohannParis Dec 17, 2024
466c018
Update DatasetStatistics.java
YohannParis Dec 17, 2024
9205194
Update DatasetService.java
YohannParis Dec 17, 2024
225f737
change when this is called
YohannParis Dec 17, 2024
b2c6efe
cleaning up
YohannParis Dec 17, 2024
dc68235
Merge branch 'main' into dataset-statistics
YohannParis Dec 17, 2024
8e32a59
add task notifications
YohannParis Dec 17, 2024
0424577
chore: lint types
github-actions[bot] Dec 17, 2024
88d91b1
Setup proper types for the column statistics
YohannParis Dec 18, 2024
9ca632d
Merge branch 'main' into dataset-statistics
YohannParis Dec 18, 2024
77e9de7
add the presignedURL
YohannParis Dec 18, 2024
2774c28
Update DatasetStatistics.java
YohannParis Dec 18, 2024
e583056
Do it properly
YohannParis Dec 18, 2024
e9e1106
Merge branch 'main' into dataset-statistics
YohannParis Dec 18, 2024
ddf812c
use the preseigned URL to pass to the task
YohannParis Dec 18, 2024
61c6407
Check that the response might be null
YohannParis Dec 18, 2024
b5c701a
better error checking
YohannParis Dec 18, 2024
8133bac
improve readibility and format response properly
YohannParis Dec 18, 2024
384217f
Update DatasetStatistics.java
YohannParis Dec 18, 2024
37fa82e
fix the issue with 🐍 to 🐫 case
YohannParis Dec 18, 2024
0c03b78
clean up how we return the response
YohannParis Dec 18, 2024
3b3e058
more 🐍 to 🐫 case
YohannParis Dec 18, 2024
b057a5f
Update DatasetStatistics.java
YohannParis Dec 18, 2024
5bab911
chore: lint types
github-actions[bot] Dec 18, 2024
8bfea09
Merge branch 'main' into dataset-statistics
YohannParis Dec 19, 2024
d34bf02
Merge branch 'dataset-statistics' of github.com:DARPA-ASKEM/terarium …
YohannParis Dec 19, 2024
f79ee49
Update DatasetStatistics.java
YohannParis Dec 19, 2024
4e02c1d
Merge branch 'main' into dataset-statistics
YohannParis Dec 19, 2024
2dbfb4a
added a maximum binning
YohannParis Dec 19, 2024
3384501
fix bining issues
YohannParis Dec 20, 2024
2b8dc16
Merge branch 'main' into dataset-statistics
YohannParis Dec 20, 2024
237df32
display better info
YohannParis Dec 20, 2024
cc0f54e
Merge branch 'main' into dataset-statistics
YohannParis Dec 20, 2024
9e23c69
Merge branch 'main' into dataset-statistics
YohannParis Jan 7, 2025
89678d6
Merge branch 'main' into dataset-statistics
YohannParis Jan 7, 2025
6249e26
Merge branch 'main' into dataset-statistics
YohannParis Jan 8, 2025
9f51d8c
Merge branch 'main' into dataset-statistics
YohannParis Jan 8, 2025
75ba23b
Merge branch 'main' into dataset-statistics
YohannParis Jan 8, 2025
3120c88
Merge branch 'main' into dataset-statistics
YohannParis Jan 8, 2025
453f3ab
rotate keys
dgauldie Jan 8, 2025
e4595bf
Merge remote-tracking branch 'origin/5985-task-rotate-keys' into data…
YohannParis Jan 8, 2025
84672ac
update secrets
dgauldie Jan 8, 2025
2e2cc71
Merge remote-tracking branch 'origin/5985-task-rotate-keys' into data…
YohannParis Jan 8, 2025
d9ad4c7
Merge branch 'main' into dataset-statistics
YohannParis Jan 8, 2025
dd73529
Merge branch 'main' into dataset-statistics
YohannParis Jan 9, 2025
a9fef76
Merge branch 'main' into dataset-statistics
YohannParis Jan 9, 2025
06715e0
Remove/update logs
YohannParis Jan 9, 2025
18e0959
first pass at displaying stats
YohannParis Jan 9, 2025
80728fd
Update tera-dataset-datatable.vue
YohannParis Jan 9, 2025
fb8d7a1
make things not mandatory
YohannParis Jan 9, 2025
680cb90
disable some tests for now.
YohannParis Jan 10, 2025
d81a060
Update DatasetControllerTests.java
YohannParis Jan 10, 2025
93e25bb
Update dataset_statistics.py
YohannParis Jan 10, 2025
07aac6d
Merge branch 'main' into dataset-statistics
YohannParis Jan 10, 2025
e15fd1a
Update number.ts
YohannParis Jan 10, 2025
a5bbffc
Update number.ts
YohannParis Jan 10, 2025
d5502c9
remove all stats
YohannParis Jan 10, 2025
ae826b3
chore: lint types
github-actions[bot] Jan 10, 2025
3cf0b6d
Update tera-dataset-datatable.vue
YohannParis Jan 10, 2025
21415e2
Merge branch 'dataset-statistics' of github.com:DARPA-ASKEM/terarium …
YohannParis Jan 10, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,8 @@
<!-- Toggle histograms & column summary charts -->
<div class="datatable-toolbar">
<span class="datatable-toolbar-item">
{{ rawContent.headers.length || 'No' }} columns | {{ rawContent.rowCount || 'No' }} rows
{{ rawContent.rowCount != rawContent.csv.length ? '(' + rawContent.csv.length + ' showing)' : '' }}
{{ columns?.length || 'No' }} columns | {{ rowCount || 'No' }} rows
{{ rowCount != rawContent.csv.length ? '(' + rawContent.csv.length + ' showing)' : '' }}
</span>
<span class="datatable-toolbar-item" style="margin-left: auto">
Show column summaries<InputSwitch v-model="showSummaries" />
Expand Down Expand Up @@ -57,39 +57,42 @@
:header="colName"
:style="previousHeaders && !previousHeaders.includes(colName) ? 'border-color: green' : ''"
sortable
:frozen="index == 0"
:hidden="selectedColumns.includes(colName) ? false : true"
:hidden="!selectedColumns.includes(colName)"
>
<template #header v-if="!previewMode && !isEmpty(headerStats) && showSummaries">
<template #header v-if="!previewMode && showSummaries">
<!-- column summary charts below -->
<div class="column-summary">
<div class="column-summary-row">
<div class="column-summary-row" v-if="statistics.get(colName)?.maxValue">
<span class="column-summary-label">Max:</span>
<span class="column-summary-value">{{ headerStats?.[index].maxValue }}</span>
<span class="column-summary-value">{{ statistics.get(colName)?.maxValue }}</span>
</div>
<Chart
v-if="headerStats?.[index].chartData"
v-if="statistics.get(colName)?.chartData"
class="histogram"
type="bar"
:height="480"
:data="headerStats?.[index].chartData"
:data="statistics.get(colName)?.chartData"
:options="CHART_OPTIONS"
/>
<div class="column-summary-row max">
<div class="column-summary-row max" v-if="statistics.get(colName)?.minValue">
<span class="column-summary-label">Min:</span>
<span class="column-summary-value">{{ headerStats?.[index].minValue }}</span>
<span class="column-summary-value">{{ statistics.get(colName)?.minValue }}</span>
</div>
<div class="column-summary-row">
<div class="column-summary-row" v-if="statistics.get(colName)?.mean">
<span class="column-summary-label">Mean:</span>
<span class="column-summary-value">{{ headerStats?.[index].mean }}</span>
<span class="column-summary-value">{{ statistics.get(colName)?.mean }}</span>
</div>
<div class="column-summary-row">
<div class="column-summary-row" v-if="statistics.get(colName)?.median">
<span class="column-summary-label">Median:</span>
<span class="column-summary-value">{{ headerStats?.[index].median }}</span>
<span class="column-summary-value">{{ statistics.get(colName)?.median }}</span>
</div>
<div class="column-summary-row">
<div class="column-summary-row" v-if="statistics.get(colName)?.sd">
<span class="column-summary-label">SD:</span>
<span class="column-summary-value">{{ headerStats?.[index].sd }}</span>
<span class="column-summary-value">{{ statistics.get(colName)?.sd }}</span>
</div>
<div class="column-summary-row" v-if="statistics.get(colName)?.uniqueValues">
<span class="column-summary-label">Unique Values:</span>
<span class="column-summary-value">{{ statistics.get(colName)?.uniqueValues }}</span>
</div>
</div>
</template>
Expand All @@ -99,15 +102,15 @@
</template>

<script setup lang="ts">
import { isEmpty } from 'lodash';
import { ref, watch, nextTick } from 'vue';
import { ref, watch } from 'vue';
import DataTable from 'primevue/datatable';
import Column from 'primevue/column';
import type { CsvAsset } from '@/types/Types';
import type { CsvAsset, DatasetColumn } from '@/types/Types';
import MultiSelect from 'primevue/multiselect';
import Button from 'primevue/button';
import Chart from 'primevue/chart';
import InputSwitch from 'primevue/inputswitch';
import { displayNumber } from '@/utils/number';

type ChartData = {
labels: string[];
Expand All @@ -122,13 +125,26 @@ type ChartData = {
}[];
};

type ColumnStats = {
minValue?: string;
maxValue?: string;
mean?: string;
median?: string;
sd?: string;
bins?: number[];
chartData?: ChartData;
uniqueValues?: string;
};

const props = defineProps<{
rawContent: CsvAsset; // Temporary - this is also any in ITypeModel
rows?: number;
previewMode?: boolean;
previousHeaders?: String[] | null;
paginatorPosition?: 'bottom' | 'both' | 'top' | undefined;
tableStyle?: String;
columns?: DatasetColumn[];
rowCount?: number;
}>();

const CATEGORYPERCENTAGE = 1.0;
Expand Down Expand Up @@ -176,16 +192,7 @@ const CHART_OPTIONS = {

const showSummaries = ref(true);
const selectedColumns = ref<string[]>(props.rawContent.headers);
const headerStats = ref<
{
minValue: number;
maxValue: number;
mean: number;
median: number;
sd: number;
chartData: ChartData;
}[]
>([]);
const statistics = ref<Map<string, ColumnStats>>(new Map());

// Given the bins for a column set up the object needed for the chart.
const setBarChartData = (bins: number[]): ChartData => {
Expand All @@ -197,7 +204,7 @@ const setBarChartData = (bins: number[]): ChartData => {
dummyLabels.push(i.toString());
}
return {
labels: ['Bin 1', 'Bin 2', 'Bin 3', 'Bin 4', 'Bin 5', 'Bin 6', 'Bin 7', 'Bin 8', 'Bin 9', 'Bin 10'].reverse(),
labels: dummyLabels,
datasets: [
{
label: 'Count',
Expand All @@ -212,26 +219,27 @@ const setBarChartData = (bins: number[]): ChartData => {
};
};

// TODO: We should be using a formatter from number.ts, not sure why we are formatting it like this (ask Yohann when he's back)
function roundStat(stat: number) {
return Math.round(stat * 1000) / 1000;
}

watch(
() => props.rawContent,
async () => {
await nextTick();
headerStats.value =
props.rawContent.stats?.map((stat) => ({
minValue: roundStat(stat.minValue),
maxValue: roundStat(stat.maxValue),
mean: roundStat(stat.mean),
median: roundStat(stat.median),
sd: roundStat(stat.sd),
chartData: setBarChartData(stat.bins)
})) ?? [];
// eslint-disable-next-line
props.rawContent.headers.sort();
() => props?.columns,
(value, oldValue) => {
if (!value || value === oldValue) return;
const stats = new Map<string, ColumnStats>();
value.forEach((column) => {
const columnStats: ColumnStats = {};
if (column.stats?.numericStats) {
columnStats.minValue = displayNumber(column.stats.numericStats.min);
columnStats.maxValue = displayNumber(column.stats.numericStats.max);
columnStats.mean = displayNumber(column.stats.numericStats.mean);
columnStats.median = displayNumber(column.stats.numericStats.median);
columnStats.sd = displayNumber(column.stats.numericStats.std_dev);
columnStats.chartData = setBarChartData(column.stats.numericStats.histogram_bins);
columnStats.uniqueValues = displayNumber(column.stats.numericStats.unique_values);
} else if (column.stats?.nonNumericStats) {
columnStats.uniqueValues = displayNumber(column.stats.nonNumericStats.unique_values);
}
stats.set(column.name, columnStats);
});
statistics.value = stats;
},
{ immediate: true }
);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,13 @@
</AccordionTab>
<AccordionTab header="Data" v-if="!isEmpty(dataset?.fileNames)">
<tera-progress-spinner v-if="!rawContent" :font-size="2" is-centered />
<tera-dataset-datatable v-else :rows="100" :raw-content="rawContent" />
<tera-dataset-datatable
v-else
:rows="100"
:raw-content="rawContent"
:columns="dataset?.columns ?? []"
:row-count="dataset?.metadata?.['total_rows'] ?? 0"
/>
</AccordionTab>
</Accordion>
</tera-asset>
Expand Down
28 changes: 27 additions & 1 deletion packages/client/hmi-client/src/types/Types.ts
Original file line number Diff line number Diff line change
Expand Up @@ -118,7 +118,6 @@ export interface ChartAnnotation extends TerariumAsset {

export interface CsvAsset {
csv: string[][];
stats?: CsvColumnStats[];
headers: string[];
rowCount: number;
}
Expand Down Expand Up @@ -194,6 +193,7 @@ export interface DatasetColumn extends TerariumEntity {
name: string;
fileName: string;
dataType: ColumnType;
stats?: DatasetColumnStats;
formatStr?: string;
annotations: string[];
metadata?: any;
Expand All @@ -202,6 +202,11 @@ export interface DatasetColumn extends TerariumEntity {
dataset?: Dataset;
}

export interface DatasetColumnStats {
numericStats: NumericColumnStats;
nonNumericStats: NonNumericColumnStats;
}

export interface DocumentAsset extends TerariumAsset {
userId?: string;
documentUrl?: string;
Expand Down Expand Up @@ -756,6 +761,26 @@ export interface Links {
self: string;
}

export interface NumericColumnStats {
mean: number;
median: number;
min: number;
max: number;
quartiles: number[];
data_type: string;
std_dev: number;
unique_values: number;
missing_values: number;
histogram_bins: number[];
}

export interface NonNumericColumnStats {
data_type: string;
unique_values: number;
most_common: { [index: string]: number };
missing_values: number;
}

export interface DocumentExtraction {
fileName: string;
assetType: ExtractionAssetType;
Expand Down Expand Up @@ -1112,6 +1137,7 @@ export enum ClientEventType {
TaskGollmGenerateSummary = "TASK_GOLLM_GENERATE_SUMMARY",
TaskGollmInterventionsFromDocument = "TASK_GOLLM_INTERVENTIONS_FROM_DOCUMENT",
TaskGollmModelCard = "TASK_GOLLM_MODEL_CARD",
TaskGollmDatasetStatistics = "TASK_GOLLM_DATASET_STATISTICS",
TaskMiraAmrToMmt = "TASK_MIRA_AMR_TO_MMT",
TaskMiraGenerateModelLatex = "TASK_MIRA_GENERATE_MODEL_LATEX",
TaskMiraCompareModelsConcepts = "TASK_MIRA_COMPARE_MODELS_CONCEPTS",
Expand Down
11 changes: 8 additions & 3 deletions packages/client/hmi-client/src/utils/number.ts
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,8 @@ export function exponentialToNumber(num: string): string {
* @param {string} num - The number string to convert.
* @returns {string} The number in NIST form.
*/
export function numberToNist(num: string) {
num = num.replace(/\s/g, '');
export function numberToNist(num: string | number) {
num = num.toString().replace(/\s/g, '');

if (Number.isNaN(Number(num))) return '';

Expand Down Expand Up @@ -73,7 +73,8 @@ export function nistToNumber(numStr: string): number {
* @param {string} num - The number string to display.
* @returns {string} The number in either exponential form or NIST form.
*/
export function displayNumber(num: string): string {
export function displayNumber(num: string | number): string {
num = num.toString().replace(/\s/g, '');
const number = fixPrecisionError(parseFloat(num));
if (countDigits(number) > 6) return number.toExponential(3);
return numberToNist(number.toString());
Expand Down Expand Up @@ -148,3 +149,7 @@ export function toScientificNotation(num: number) {

return { mantissa, exponent };
}

export function roundNumber(number: number): number {
return Math.round(number * 1000) / 1000;
}
4 changes: 4 additions & 0 deletions packages/gollm/entities.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,10 @@ class ConfigureModelDataset(BaseModel):
matrix: str = None


class DatasetStatistics(BaseModel):
datasetUrl: str # expects a URL of a CSV file


class DatasetCardModel(BaseModel):
dataset: str # expects a stringified JSON object
research_paper: str = None
Expand Down
3 changes: 2 additions & 1 deletion packages/gollm/setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,8 @@
"gollm:chart_annotation=tasks.chart_annotation:main",
"gollm:generate_summary=tasks.general_query:main",
"gollm:interventions_from_document=tasks.interventions_from_document:main",
"gollm:model_card=tasks.model_card:main"
"gollm:model_card=tasks.model_card:main",
"gollm:dataset_statistics=tasks.dataset_statistics:main",
],
},
python_requires=">=3.11",
Expand Down
Loading
Loading