Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create notebook tutorials for distributed data classifiers #415

Merged
merged 13 commits into from
Jan 23, 2025
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New link will need to be replaced on Hugging Face: https://huggingface.co/nvidia/domain-classifier.

Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,11 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Distributed Data Classification with Domain and Quality Classifiers\n",
"# Distributed Data Classification with NeMo Curator's `DomainClassifier`\n",
"\n",
"The notebook demonstrates the use of two classifiers for distributed data classification, including domain and quality classifiers. The [domain classifier](https://huggingface.co/nvidia/domain-classifier) is used to classify the domain of the data, while the [quality classifier](https://huggingface.co/nvidia/quality-classifier-deberta) is used to classify the quality of the data. These classifers help with annotation which helps data blending for foundation model training.\n",
"This notebook demonstrates the use of NeMo Curator's `DomainClassifier`. The [domain classifier](https://huggingface.co/nvidia/domain-classifier) is used to classify the domain of a text. It helps with data annotation, which is useful in data blending for foundation model training.\n",
"\n",
"The classifiers are accelerated using [CrossFit](https://github.com/rapidsai/crossfit), a library that leverages intellegent batching and RAPIDS to accelerate the offline inference on large datasets."
"The domain classifier is accelerated using [CrossFit](https://github.com/rapidsai/crossfit), a library that leverages intellegent batching and RAPIDS to accelerate the offline inference on large datasets."
]
},
{
Expand Down Expand Up @@ -39,7 +39,7 @@
"outputs": [],
"source": [
"from nemo_curator import get_client\n",
"from nemo_curator.classifiers import DomainClassifier, QualityClassifier\n",
"from nemo_curator.classifiers import DomainClassifier\n",
"from nemo_curator.datasets import DocumentDataset\n",
"import cudf\n",
"import dask_cudf"
Expand All @@ -49,7 +49,15 @@
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"cuDF Spilling is enabled\n"
]
}
],
"source": [
"client = get_client(cluster_type=\"gpu\")"
]
Expand All @@ -63,7 +71,7 @@
},
{
"cell_type": "code",
"execution_count": 4,
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -74,23 +82,14 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Create a Classifier"
"# Prepare Text Data and Initialize Classifier"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"classifier_type = \"DomainClassifier\" # or \"QualityClassifier\""
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"# Create sample DataFrame\n",
"text = [\n",
Expand Down Expand Up @@ -119,18 +118,11 @@
},
{
"cell_type": "code",
"execution_count": 7,
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"if classifier_type == \"DomainClassifier\":\n",
" classifier = DomainClassifier(batch_size=1024)\n",
"\n",
"elif classifier_type == \"QualityClassifier\":\n",
" classifier = QualityClassifier(batch_size=1024)\n",
"\n",
"else:\n",
" raise ValueError(\"Invalid classifier type\")"
"classifier = DomainClassifier(batch_size=1024)"
]
},
{
Expand All @@ -139,35 +131,22 @@
"source": [
"# Run the Classifier\n",
"\n",
"Dask operations are lazy, so the the classifier will not run until we call a eager operation like `to_json`, `compute` or `persist`. "
"Dask operations are lazy, so the the classifier will not run until we call an eager operation like `to_json`, `compute`, or `persist`. "
]
},
{
"cell_type": "code",
"execution_count": 8,
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Starting domain classifier inference\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"GPU: 0, Part: 0: 100%|██████████| 10/10 [00:04<00:00, 2.12it/s]\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Writing to disk complete for 1 partitions\n",
"CPU times: user 393 ms, sys: 244 ms, total: 638 ms\n",
"Wall time: 6.04 s\n"
"Starting domain classifier inference\n",
"Writing to disk complete for 1 partition(s)\n",
"CPU times: user 2.56 s, sys: 1.65 s, total: 4.21 s\n",
"Wall time: 19.5 s\n"
]
}
],
Expand All @@ -187,7 +166,7 @@
},
{
"cell_type": "code",
"execution_count": 9,
"execution_count": null,
"metadata": {},
"outputs": [
{
Expand Down Expand Up @@ -268,20 +247,20 @@
"4 Traveling to Europe during the off-season can ... "
]
},
"execution_count": 9,
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"output_dataset = DocumentDataset.read_json(output_file_path, backend=\"cudf\", add_filename=write_to_filename)\n",
"output_dataset.df.head()"
"output_dataset.head()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "NeMo-Curator-env-2",
"display_name": "nemo_curator",
"language": "python",
"name": "python3"
},
Expand All @@ -295,7 +274,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.14"
"version": "3.10.15"
}
},
"nbformat": 4,
Expand Down
Loading
Loading