Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GSProcessing] Enforce re-order for node label processing during classification #1136

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

thvasilo
Copy link
Contributor

@thvasilo thvasilo commented Jan 17, 2025

Issue #, if available:

Fixes #1135
Fixes #1138

Description of changes:

  • We guarantee ordering for node label classification by ordering the transformed label DF after processing by the NODE_INT_MAPPING id, and doing the same for masks.
  • Because there is no guarantee for order when writing to Parquet from Spark even for ordered Spark DataFrames, we collect the labels and masks to a Pandas DF on the Spark leader, and write that using pyarrow.
  • NOTE: DistPart requires at least N files to be present for every node/edge type, where N is the number of requested partitions. So when writing back the masks and label we still use the number of incoming partitions to create as many files.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@thvasilo thvasilo added ready able to trigger the CI gsprocessing For issues and PRs related the the GSProcessing library 0.4.1 labels Jan 17, 2025
@thvasilo thvasilo force-pushed the reorder-node-ids branch 2 times, most recently from 4d0afcd to 865d943 Compare January 17, 2025 19:28
@thvasilo thvasilo marked this pull request as ready for review January 17, 2025 19:42
@thvasilo thvasilo requested a review from jalencato January 17, 2025 19:42
@thvasilo thvasilo self-assigned this Jan 17, 2025
@thvasilo thvasilo added this to the 0.4.1 release milestone Jan 17, 2025
Copy link
Collaborator

@jalencato jalencato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any performance number to compare between the built-in spark method and pandaUDF method?

@thvasilo thvasilo added the bug Something isn't working label Jan 22, 2025
Copy link
Collaborator

@jalencato jalencato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM. Let us keep an eye on further experiment.

@@ -18,6 +18,8 @@
import logging
from typing import Any, Dict, Optional

from graphstorm_processing.constants import VALID_TASK_TYPES

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove additional line.

@@ -194,8 +197,6 @@ def create_new_relative_path_from_existing(
"path/to/parquet-repartitioned-my-suffix/part-00003-filename.snappy.parquet"
"""
original_relative_path_obj = Path(original_relative_path)
# We expect files to have a path of the form /path/to/parquet/part-00001.snappy.parquet
assert original_relative_path_obj.parts[-2] == "parquet" # TODO: Remove this assumption?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we meet any issue for it?

# Generate Spark-style part filename
part_filename = os.path.join(
base_path,
f"part-{file_idx:05d}-{unique_id}.parquet",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So unlike pyspark, pyarrow does not support the built-in naming?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we could try using pyarrow.ds.dataset to do the writing to provide that but that actually complicates things. This allows us to have explicit distribution of rows.

node_int_ids = np.arange(total_data_points)

rng = np.random.default_rng(42)
# Create random numerical labels with values 0-9
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we cover the case for None value?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.4.1 bug Something isn't working gsprocessing For issues and PRs related the the GSProcessing library ready able to trigger the CI
Projects
None yet
2 participants