Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Added ML-BOM examples #50

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
119 changes: 119 additions & 0 deletions MLBOM/Dataset/bom.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
{
"$schema": "http://cyclonedx.org/schema/bom-1.6.schema.json",
"bomFormat": "CycloneDX",
"specVersion": "1.6",
"serialNumber": "urn:uuid:75de3b9b-9e53-4421-a259-11f18afc22bf",
"version": 1,
"metadata": {
"timestamp": "2024-11-24T13:10:49Z",
},
"components": [
{
"type": "data",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a data component but does not include a data property. Data components should include the data property, Without it, the consumer does not know what kind of data this is (e.g. configuration, source code, dataset, etc). Refer to https://cyclonedx.org/docs/1.6/json/#components_items_data

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added data property. I noticed it includes also a contents.url property. Should I put the huggingface link here, in externalReferences.url, or both?

"supplier": {
"name": "Wikimedia"
},
"manufacturer": {
"name": "Wikimedia"
},
"publisher": "Hugging Face Inc",
"name": "wikipedia",
"version": "b04c8d1ceb2f5cd4588862100d08de323dccfbaa",
"data": [
{
"type": "dataset",
"name": "wikipedia",
"contents": {
"url": "https://huggingface.co/datasets/wikimedia/wikipedia",
}
}
],
"licenses": [
{
"license": {
"id": "CC-BY-SA-3.0",
"name": "Creative Commons Attribution Share Alike 3.0",
"url": "https://spdx.org/licenses/CC-BY-SA-3.0.html"
}
},
{
"license": {
"id": "GFDL-1.3",
"name": "GNU Free Documentation License family",
"url": "https://www.gnu.org/licenses/fdl-1.3.en.html"
}
}
],
"externalReferences": [
{
"type": "website",
"url": "https://huggingface.co/datasets/wikimedia/wikipedia"
}
],
"hashes": [
{
"alg": "SHA-1",
"content": "b04c8d1ceb2f5cd4588862100d08de323dccfbaa"
}
],
"properties": [
{
"name": "task_categories",
"value": "text-generation"
},
{
"name": "task_categories",
"value": "fill-mask"
},
{
"name": "task_ids",
"value": "language-modeling"
},
{
"name": "task_ids",
"value": "masked-language-modeling"
},
{
"name": "language",
"value": "en"
},
{
"name": "language",
"value": "es"
},
{
"name": "size_categories",
"value": "10M<n<100M"
},
{
"name": "format",
"value": "parquet"
},
{
"name": "modality",
"value": "text"
},
{
"name": "library",
"value": "datasets"
},
{
"name": "library",
"value": "dask"
},
{
"name": "library",
"value": "mlcroissant"
},
{
"name": "library",
"value": "polars"
},
{
"name": "region",
"value": "us"
}
]
}
]
}
47 changes: 47 additions & 0 deletions MLBOM/Model/FoundationModels/bom.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
{
"$schema": "http://cyclonedx.org/schema/bom-1.6.schema.json",
"bomFormat": "CycloneDX",
"specVersion": "1.6",
"serialNumber": "urn:uuid:56315ffe-c0af-4474-9c11-c94d1af986a9",
"version": 1,
"metadata": {
"timestamp": "2024-11-24T13:05:42Z",
"manufacturer": {
"name": "Noma Security Inc."
}
},
"components": [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to include a data subcomponent within the machine-learning-model? If so, that would tell the consumer information about the data used in this foundation model.

This component does not include the modelCard property, which it should.

Refer to:

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I was struggling to find what interesting information about GPT-4o should we add to the modelCard, since we are not accessible to the datasets it was trained on, to the complete ethical considerations, etc . I added a proposal of describing the model architecture, and the inputs & outputs. let me know what do you think :)

  2. Maybe I misunderstood, but as far as I understand, components that are not typed data, should not have the data property defined. Can you please advise?
    (quote from here)

{
"type": "machine-learning-model",
"supplier": {
"name": "OpenAI Inc"
},
"manufacturer": {
"name": "OpenAI Inc"
},
"publisher": "OpenAI Inc",
"name": "gpt-4o",
"modelCard": {
"modelParameters": {
"modelArchitecture": "GPT-4",
"inputs": [
{
"format": "string"
},
{
"format": "image"
}
],
"outputs": [
{
"format": "string"
},
{
"format": "image"
}
]
}
}
}
]
}
87 changes: 87 additions & 0 deletions MLBOM/Model/OpenSource/bom.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
{
"$schema": "http://cyclonedx.org/schema/bom-1.6.schema.json",
"bomFormat": "CycloneDX",
"specVersion": "1.6",
"serialNumber": "urn:uuid:21d0b6f8-f5b0-44df-8587-79c5d70cd1da",
"version": 1,
"metadata": {
"timestamp": "2024-11-24T13:10:49Z",
},
"components": [
{
"type": "machine-learning-model",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue here. The type does not have a modelCard property. And there should ideally be a data subcomponent.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I added modelCard that includes the datasets this huggingface model was trained on
  2. (Same as above) Maybe I misunderstood, but as far as I understand, components that are not typed data, should not have the data property defined. Can you please advise?
    (quote from here)

"supplier": {
"name": "google-bert"
},
"manufacturer": {
"name": "google-bert"
},
"publisher": "Hugging Face Inc",
"name": "bert-base-cased",
"version": "cd5ef92a9fb2f889e972770a36d4ed042daf221e",
"licenses": [
{
"license": {
"id": "Apache-2.0",
"name": "Apache License 2.0",
"url": "https://www.apache.org/licenses/LICENSE-2.0"
}
}
],
"externalReferences": [
{
"type": "website",
"url": "https://huggingface.co/google-bert/bert-base-cased"
}
],
"hashes": [
{
"alg": "SHA-1",
"content": "cd5ef92a9fb2f889e972770a36d4ed042daf221e"
}
],
"modelCard": {
"modelParameters": {
"datasets": [
{
"type": "dataset",
"name": "legacy-datasets/wikipedia",
"contents": {
"url": "https://huggingface.co/datasets/legacy-datasets/wikipedia"
},
"description": "Wikipedia dataset containing cleaned articles of all languages."
},
{
"type": "dataset",
"name": "bookcorpus/bookcorpus",
"contents": {
"url": "https://huggingface.co/datasets/bookcorpus/bookcorpus"
},
"description": "A corpus of fine-grained information and high-level semantics text"
}
]
}
},
"properties": [
{
"name": "region",
"value": "us"
}
],
"tags": [
"transformers",
"pytorch",
"tf",
"jax",
"safetensors",
"bert",
"fill-mask",
"exbert",
"en",
"arxiv:1810.04805",
"autotrain_compatible",
"endpoints_compatible"
]
}
]
}
28 changes: 28 additions & 0 deletions MLBOM/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# Machine Learning Bill of Materials (ML-BOM)

Machine learning, particularly AI, often lacks transparency regarding models' usage, creation processes, and lifecycle within organizations. The Machine Learning Bill of Materials (ML-BOM) builds on CycloneDX to offer a detailed representation of machine learning models, datasets, and related artifacts. ML-BOM empowers organizations to document, manage, and secure their machine learning assets while enhancing visibility into model lineage and mitigating supply chain risks.

## Features of ML-BOM
- Captures machine learning models, datasets, libraries, and their interdependencies.
- Documents comprehensive metadata about models, including architecture, training datasets, performance metrics, and ethical considerations.
- Facilitates model lineage tracking, integrating it into the lifecycle management of ML components from design to decommission.
- Enhances transparency by illustrating how software incorporates ML/AI components and embedding them within the broader SBOM framework.
- Highlights critical details about model biases and ethical implications stemming from training datasets, while identifying and classifying the presence of sensitive data in datasets or trained models.

## Key Components

### 1. **Machine Learning Models**
ML-BOM can document models and their parameters, including Model Architecture and Performance Metrics, and Ethical and Fairness Considerations

### 2. **Datasets**
Datasets used for training, validation, and inference can be described with:
- **Data Classification**: Tags to specify sensitivity and value.
- **Data Governance**: Ownership, stewardship, and custodianship details.
- **Sensitive Data**: Annotations for datasets containing sensitive information.

### 3. **Libraries**
ML-BOM provides a detailed overview of the dependencies models have on specific ML/AI libraries, ensuring transparency and traceability in their usage, including their versioning, licenses, and security considerations


## High-Level Object Model
![CycloneDX Object Model Swimlane](https://cyclonedx.org/theme/assets/images/CycloneDX-Object-Model-Swimlane.svg)
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ are categorized by different BOM types including:
|--------------------|-----------------------------------------|
| [CBOM](CBOM) | Cryptography Bill of Materials |
| [HBOM](HBOM) | Hardware Bill of Materials |
| [MLBOM](MLBOM) | Machine Learning Bill of Materials |
| [OBOM](OBOM) | Operations Bill of Materials |
| [SaaSBOM](SaaSBOM) | Software-as-a-Service Bill of Materials |
| [SBOM](SBOM) | Software Bill of Materials |
Expand Down