-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GSProcessing] Enforce re-order for node label processing during classification #1136
base: main
Are you sure you want to change the base?
Conversation
4d0afcd
to
865d943
Compare
865d943
to
40b16f9
Compare
40b16f9
to
811f4b1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have any performance number to compare between the built-in spark method and pandaUDF method?
...graphstorm_processing/data_transformations/dist_transformations/dist_label_transformation.py
Show resolved
Hide resolved
graphstorm-processing/graphstorm_processing/graph_loaders/dist_heterogeneous_loader.py
Show resolved
Hide resolved
graphstorm-processing/graphstorm_processing/graph_loaders/dist_heterogeneous_loader.py
Outdated
Show resolved
Hide resolved
graphstorm-processing/graphstorm_processing/graph_loaders/dist_heterogeneous_loader.py
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM. Let us keep an eye on further experiment.
@@ -18,6 +18,8 @@ | |||
import logging | |||
from typing import Any, Dict, Optional | |||
|
|||
from graphstorm_processing.constants import VALID_TASK_TYPES | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove additional line.
@@ -194,8 +197,6 @@ def create_new_relative_path_from_existing( | |||
"path/to/parquet-repartitioned-my-suffix/part-00003-filename.snappy.parquet" | |||
""" | |||
original_relative_path_obj = Path(original_relative_path) | |||
# We expect files to have a path of the form /path/to/parquet/part-00001.snappy.parquet | |||
assert original_relative_path_obj.parts[-2] == "parquet" # TODO: Remove this assumption? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we meet any issue for it?
# Generate Spark-style part filename | ||
part_filename = os.path.join( | ||
base_path, | ||
f"part-{file_idx:05d}-{unique_id}.parquet", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So unlike pyspark, pyarrow does not support the built-in naming?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we could try using pyarrow.ds.dataset to do the writing to provide that but that actually complicates things. This allows us to have explicit distribution of rows.
node_int_ids = np.arange(total_data_points) | ||
|
||
rng = np.random.default_rng(42) | ||
# Create random numerical labels with values 0-9 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we cover the case for None value?
Issue #, if available:
Fixes #1135
Fixes #1138
Description of changes:
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.