[GSProcessing] Enforce re-order for node label processing during classification #1136

thvasilo · 2025-01-17T06:42:15Z

Issue #, if available:

Description of changes:

We guarantee ordering for node label classification by ordering the transformed label DF after processing by the NODE_INT_MAPPING id, and doing the same for masks.
Because there is no guarantee for order when writing to Parquet from Spark even for ordered Spark DataFrames, we collect the labels and masks to a Pandas DF on the Spark leader, and write that using pyarrow.
NOTE: DistPart requires at least N files to be present for every node/edge type, where N is the number of requested partitions. So when writing back the masks and label we still use the number of incoming partitions to create as many files.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

…sification

jalencato

Do we have any performance number to compare between the built-in spark method and pandaUDF method?

...graphstorm_processing/data_transformations/dist_transformations/dist_label_transformation.py

graphstorm-processing/graphstorm_processing/distributed_executor.py

graphstorm-processing/graphstorm_processing/graph_loaders/dist_heterogeneous_loader.py

jalencato

Overall LGTM. Let us keep an eye on further experiment.

jalencato · 2025-01-24T22:01:05Z

graphstorm-processing/graphstorm_processing/config/label_config_base.py

@@ -18,6 +18,8 @@
 import logging
 from typing import Any, Dict, Optional

+from graphstorm_processing.constants import VALID_TASK_TYPES
+


Remove additional line.

jalencato · 2025-01-24T22:03:13Z

graphstorm-processing/graphstorm_processing/repartition_files.py

@@ -194,8 +197,6 @@ def create_new_relative_path_from_existing(
            "path/to/parquet-repartitioned-my-suffix/part-00003-filename.snappy.parquet"
        """
        original_relative_path_obj = Path(original_relative_path)
-        # We expect files to have a path of the form /path/to/parquet/part-00001.snappy.parquet
-        assert original_relative_path_obj.parts[-2] == "parquet"  # TODO: Remove this assumption?


Do we meet any issue for it?

jalencato · 2025-01-24T22:08:06Z

graphstorm-processing/graphstorm_processing/graph_loaders/dist_heterogeneous_loader.py

+            # Generate Spark-style part filename
+            part_filename = os.path.join(
+                base_path,
+                f"part-{file_idx:05d}-{unique_id}.parquet",


So unlike pyspark, pyarrow does not support the built-in naming?

Yes, we could try using pyarrow.ds.dataset to do the writing to provide that but that actually complicates things. This allows us to have explicit distribution of rows.

jalencato · 2025-01-24T22:15:36Z

graphstorm-processing/tests/test_dist_heterogenous_loader.py

+    node_int_ids = np.arange(total_data_points)
+
+    rng = np.random.default_rng(42)
+    # Create random numerical labels with values 0-9


Do we cover the case for None value?

thvasilo added ready able to trigger the CI gsprocessing For issues and PRs related the the GSProcessing library 0.4.1 labels Jan 17, 2025

thvasilo force-pushed the reorder-node-ids branch 2 times, most recently from 4d0afcd to 865d943 Compare January 17, 2025 19:28

thvasilo marked this pull request as ready for review January 17, 2025 19:42

thvasilo requested a review from jalencato January 17, 2025 19:42

thvasilo self-assigned this Jan 17, 2025

thvasilo added this to the 0.4.1 release milestone Jan 17, 2025

thvasilo force-pushed the reorder-node-ids branch from 865d943 to 40b16f9 Compare January 17, 2025 21:13

[GSProcessing] Enforce ordering for node label processing during clas…

811f4b1

…sification

thvasilo force-pushed the reorder-node-ids branch from 40b16f9 to 811f4b1 Compare January 17, 2025 21:19

jalencato reviewed Jan 17, 2025

View reviewed changes

thvasilo added the bug Something isn't working label Jan 22, 2025

thvasilo added 2 commits January 23, 2025 00:24

Small review comment

e88a651

Make parquet output write multiple files.

f939747

jalencato approved these changes Jan 24, 2025

View reviewed changes

jalencato reviewed Jan 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GSProcessing] Enforce re-order for node label processing during classification #1136

[GSProcessing] Enforce re-order for node label processing during classification #1136

thvasilo commented Jan 17, 2025 •

edited

Loading

jalencato left a comment

jalencato left a comment

jalencato Jan 24, 2025

jalencato Jan 24, 2025

jalencato Jan 24, 2025

thvasilo Jan 25, 2025

jalencato Jan 24, 2025

[GSProcessing] Enforce re-order for node label processing during classification #1136

Are you sure you want to change the base?

[GSProcessing] Enforce re-order for node label processing during classification #1136

Conversation

thvasilo commented Jan 17, 2025 • edited Loading

jalencato left a comment

Choose a reason for hiding this comment

jalencato left a comment

Choose a reason for hiding this comment

jalencato Jan 24, 2025

Choose a reason for hiding this comment

jalencato Jan 24, 2025

Choose a reason for hiding this comment

jalencato Jan 24, 2025

Choose a reason for hiding this comment

thvasilo Jan 25, 2025

Choose a reason for hiding this comment

jalencato Jan 24, 2025

Choose a reason for hiding this comment

thvasilo commented Jan 17, 2025 •

edited

Loading