Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Apply obstore as storage backend #3033

Open
wants to merge 18 commits into
base: master
Choose a base branch
from

Conversation

machichima
Copy link
Contributor

@machichima machichima commented Jan 4, 2025

Tracking issue

Related to flyteorg/flyte#4081

Why are the changes needed?

Use a Rust/Pyo3 package - obstore - as the storage backend for cloud storages. This provides the smaller dependencies size and enable users to use their own s3fs, gsfs, abfs, ... version.

What changes were proposed in this pull request?

Use obstore as the storage backend to replace s3fs, gsfs, and abfs.

How was this patch tested?

Setup process

Screenshots

Performance

  • put file to minio

put_file_runtime

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Related PRs

Docs link

Summary by Bito

Implementation of obstore as the new storage backend for cloud services in Flytekit, replacing direct cloud storage implementations with obstore-based filesystem classes. The implementation includes enhanced path splitting, S3 retry support, and updated Azure storage configuration. Features include lru_cache decorators for storage initialization, improved error handling, and modifications to S3, GCS, and Azure storage provider initialization. Package updated from 0.3.0b9 to 0.3.0b10 in pyproject.toml, providing robust bucket handling and async filesystem support.

Unit tests added: True

Estimated effort to review (1-5, lower is better): 4

@flyte-bot
Copy link
Contributor

flyte-bot commented Jan 4, 2025

Code Review Agent Run #39883a

Actionable Suggestions - 7
  • plugins/flytekit-spark/flytekitplugins/spark/models.py - 2
    • Missing pod parameters in with_overrides method · Line 79-80
    • Consider adding null validation checks · Line 193-194
  • flytekit/core/data_persistence.py - 5
Additional Suggestions - 3
  • flytekit/core/data_persistence.py - 2
    • Consider optimizing bucket extraction timing · Line 521-522
    • Consider combining empty dict initializations · Line 59-60
  • plugins/flytekit-spark/tests/test_spark_task.py - 1
Review Details
  • Files reviewed - 5 · Commit Range: 64c6c79..0187150
    • Dockerfile.dev
    • flytekit/core/data_persistence.py
    • plugins/flytekit-spark/flytekitplugins/spark/models.py
    • plugins/flytekit-spark/flytekitplugins/spark/task.py
    • plugins/flytekit-spark/tests/test_spark_task.py
  • Files skipped - 0
  • Tools
    • Whispers (Secret Scanner) - ✔︎ Successful
    • Detect-secrets (Secret Scanner) - ✔︎ Successful
    • MyPy (Static Code Analysis) - ✔︎ Successful
    • Astral Ruff (Static Code Analysis) - ✔︎ Successful

AI Code Review powered by Bito Logo

@flyte-bot
Copy link
Contributor

flyte-bot commented Jan 4, 2025

Changelist by Bito

This pull request implements the following key changes.

Key Change Files Impacted
Feature Improvement - Cloud Storage Backend Migration to obstore

Dockerfile.dev - Added obstore package dependency

pyproject.toml - Updated obstore package version to 0.3.0b10

data_persistence.py - Implemented obstore-based storage backend with enhanced path handling and configuration

obstore_filesystem.py - Added new filesystem classes for S3, GCS and Azure using obstore

test_data.py - Updated storage backend tests to use obstore implementations

Feature Improvement - Cloud Storage Backend Migration to obstore

Dockerfile.dev - Added obstore package dependency

pyproject.toml - Updated obstore package version to 0.3.0b10

data_persistence.py - Implemented obstore-based storage backend with enhanced path handling and configuration

obstore_filesystem.py - Added new filesystem classes for S3, GCS and Azure using obstore

test_data.py - Improved code formatting and readability in test cases

test_data_persistence.py - Updated Azure storage tests with mocking and base64 encoding

test_flyte_directory.py - Updated S3 filesystem test mocking to use new obstore implementation

Comment on lines 79 to 80
driver_pod=self.driver_pod,
executor_pod=self.executor_pod,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing pod parameters in with_overrides method

Consider adding driver_pod and executor_pod to the with_overrides method to maintain consistency with the constructor parameters.

Code suggestion
Check the AI-generated fix before applying
 @@ -56,6 +56,8 @@ def with_overrides(
          new_spark_conf: Optional[Dict[str, str]] = None,
          new_hadoop_conf: Optional[Dict[str, str]] = None,
          new_databricks_conf: Optional[Dict[str, Dict]] = None,
 +        driver_pod: Optional[K8sPod] = None,
 +        executor_pod: Optional[K8sPod] = None,
      ) -> "SparkJob":
          if not new_spark_conf:
              new_spark_conf = self.spark_conf
 @@ -65,6 +67,12 @@ def with_overrides(
          if not new_databricks_conf:
              new_databricks_conf = self.databricks_conf
 
 +        if not driver_pod:
 +            driver_pod = self.driver_pod
 +
 +        if not executor_pod:
 +            executor_pod = self.executor_pod
 +
          return SparkJob(
              spark_type=self.spark_type,
              application_file=self.application_file,

Code Review Run #39883a


Is this a valid issue, or was it incorrectly flagged by the Agent?

  • it was incorrectly flagged

Comment on lines 193 to 194
driverPod=self.driver_pod.to_flyte_idl() if self.driver_pod else None,
executorPod=self.executor_pod.to_flyte_idl() if self.executor_pod else None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding null validation checks

Consider adding null checks for to_flyte_idl() calls on driver_pod and executor_pod to avoid potential NoneType errors.

Code suggestion
Check the AI-generated fix before applying
Suggested change
driverPod=self.driver_pod.to_flyte_idl() if self.driver_pod else None,
executorPod=self.executor_pod.to_flyte_idl() if self.executor_pod else None,
driverPod=self.driver_pod.to_flyte_idl() if self.driver_pod and hasattr(self.driver_pod, 'to_flyte_idl') else None,
executorPod=self.executor_pod.to_flyte_idl() if self.executor_pod and hasattr(self.executor_pod, 'to_flyte_idl') else None,

Code Review Run #39883a


Is this a valid issue, or was it incorrectly flagged by the Agent?

  • it was incorrectly flagged

Comment on lines 119 to 121
if "file" in path:
# no bucket for file
return "", path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improve file protocol detection precision

The condition if "file" in path may match paths containing 'file' anywhere in the string, not just the protocol. Consider using if get_protocol(path) == "file" for more precise protocol checking.

Code suggestion
Check the AI-generated fix before applying
Suggested change
if "file" in path:
# no bucket for file
return "", path
if get_protocol(path) == "file":
# no bucket for file
return "", path

Code Review Run #39883a


Is this a valid issue, or was it incorrectly flagged by the Agent?

  • it was incorrectly flagged

Comment on lines 136 to 141
support_types = ["s3", "gs", "abfs"]
if protocol in support_types:
file_path = "/".join(path_li[1:])
return (bucket, file_path)
else:
return bucket, path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider moving storage types to constant

The list of supported storage types support_types = ['s3', 'gs', 'abfs'] could be defined as a module-level constant since it's used for validation. Consider moving it outside the function to improve maintainability.

Code suggestion
Check the AI-generated fix before applying
 @@ -53,1 +53,2 @@
  _ANON = "anon"
 +SUPPORTED_STORAGE_TYPES = ["s3", "gs", "abfs"]
 @@ -136,2 +136,1 @@
 -        support_types = ["s3", "gs", "abfs"]
 -        if protocol in support_types:
 +        if protocol in SUPPORTED_STORAGE_TYPES:

Code Review Run #39883a


Is this a valid issue, or was it incorrectly flagged by the Agent?

  • it was incorrectly flagged

kwargs["store"] = store

if anonymous:
kwargs[_ANON] = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using anonymous parameter for _ANON

Consider using kwargs[_ANON] = anonymous instead of hardcoding True to maintain consistency with the input parameter value.

Code suggestion
Check the AI-generated fix before applying
Suggested change
kwargs[_ANON] = True
kwargs[_ANON] = anonymous

Code Review Run #39883a


Is this a valid issue, or was it incorrectly flagged by the Agent?

  • it was incorrectly flagged

Comment on lines +433 to +434
bucket, to_path_file_only = split_path(to_path)
file_system = await self.get_async_filesystem_for_path(to_path, bucket)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider validating bucket before filesystem call

Consider validating the bucket parameter before passing it to get_async_filesystem_for_path(). An empty bucket could cause issues with certain storage backends. Similar issues were also found in:

  • flytekit/core/data_persistence.py (line 318)
  • flytekit/core/data_persistence.py (line 521)
  • flytekit/core/data_persistence.py (line 308)
Code suggestion
Check the AI-generated fix before applying
Suggested change
bucket, to_path_file_only = split_path(to_path)
file_system = await self.get_async_filesystem_for_path(to_path, bucket)
bucket, to_path_file_only = split_path(to_path)
protocol = get_protocol(to_path)
if protocol in ['s3', 'gs', 'abfs'] and not bucket:
raise ValueError(f'Bucket cannot be empty for {protocol} protocol')
file_system = await self.get_async_filesystem_for_path(to_path, bucket)

Code Review Run #39883a


Is this a valid issue, or was it incorrectly flagged by the Agent?

  • it was incorrectly flagged

@flyte-bot
Copy link
Contributor

flyte-bot commented Jan 4, 2025

Code Review Agent Run #8926b7

Actionable Suggestions - 0
Review Details
  • Files reviewed - 1 · Commit Range: 0187150..7c76cc6
    • pyproject.toml
  • Files skipped - 0
  • Tools
    • Whispers (Secret Scanner) - ✔︎ Successful
    • Detect-secrets (Secret Scanner) - ✔︎ Successful
    • MyPy (Static Code Analysis) - ✔︎ Successful
    • Astral Ruff (Static Code Analysis) - ✔︎ Successful

AI Code Review powered by Bito Logo

@flyte-bot
Copy link
Contributor

flyte-bot commented Jan 5, 2025

Code Review Agent Run #0b7f4d

Actionable Suggestions - 4
  • flytekit/core/data_persistence.py - 4
Additional Suggestions - 1
  • flytekit/core/data_persistence.py - 1
    • Consider combining dictionary initializations · Line 59-60
Review Details
  • Files reviewed - 3 · Commit Range: 58ba73c..353f000
    • Dockerfile.dev
    • flytekit/core/data_persistence.py
    • pyproject.toml
  • Files skipped - 0
  • Tools
    • Whispers (Secret Scanner) - ✔︎ Successful
    • Detect-secrets (Secret Scanner) - ✔︎ Successful
    • MyPy (Static Code Analysis) - ✔︎ Successful
    • Astral Ruff (Static Code Analysis) - ✔︎ Successful

AI Code Review powered by Bito Logo

Comment on lines +433 to +434
bucket, to_path_file_only = split_path(to_path)
file_system = await self.get_async_filesystem_for_path(to_path, bucket)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider extracting path splitting logic

Consider extracting the bucket and path splitting logic into a separate method to improve code reusability and maintainability. The split_path function is used in multiple places and could be encapsulated better.

Code suggestion
Check the AI-generated fix before applying
Suggested change
bucket, to_path_file_only = split_path(to_path)
file_system = await self.get_async_filesystem_for_path(to_path, bucket)
bucket, path = self._split_and_get_bucket_path(to_path)
file_system = await self.get_async_filesystem_for_path(to_path, bucket)

Code Review Run #0b7f4d


Is this a valid issue, or was it incorrectly flagged by the Agent?

  • it was incorrectly flagged

Comment on lines +391 to +392
bucket, from_path_file_only = split_path(from_path)
file_system = await self.get_async_filesystem_for_path(from_path, bucket)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handle empty bucket case for storage

Consider handling the case where split_path() returns empty bucket for non-file protocols. Currently passing empty bucket to get_async_filesystem_for_path() could cause issues with cloud storage access.

Code suggestion
Check the AI-generated fix before applying
Suggested change
bucket, from_path_file_only = split_path(from_path)
file_system = await self.get_async_filesystem_for_path(from_path, bucket)
bucket, from_path_file_only = split_path(from_path)
protocol = get_protocol(from_path)
if protocol not in ['file'] and not bucket:
raise ValueError(f'Empty bucket not allowed for protocol {protocol}')
file_system = await self.get_async_filesystem_for_path(from_path, bucket)

Code Review Run #0b7f4d


Is this a valid issue, or was it incorrectly flagged by the Agent?

  • it was incorrectly flagged

Comment on lines 735 to 737
fsspec.register_implementation("s3", AsyncFsspecStore)
fsspec.register_implementation("gs", AsyncFsspecStore)
fsspec.register_implementation("abfs", AsyncFsspecStore)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider relocating fsspec implementation registrations

Consider moving the fsspec implementation registrations to a more appropriate initialization location, such as a module-level __init__.py or a dedicated setup function. This would improve code organization and make the registrations more discoverable.

Code suggestion
Check the AI-generated fix before applying
Suggested change
fsspec.register_implementation("s3", AsyncFsspecStore)
fsspec.register_implementation("gs", AsyncFsspecStore)
fsspec.register_implementation("abfs", AsyncFsspecStore)
def register_fsspec_implementations():
fsspec.register_implementation("s3", AsyncFsspecStore)
fsspec.register_implementation("gs", AsyncFsspecStore)
fsspec.register_implementation("abfs", AsyncFsspecStore)
# Call during module initialization
register_fsspec_implementations()

Code Review Run #0b7f4d


Is this a valid issue, or was it incorrectly flagged by the Agent?

  • it was incorrectly flagged

Specify the class properties for each file storage

Signed-off-by: machichima <[email protected]>
Copy link

codecov bot commented Jan 5, 2025

Codecov Report

Attention: Patch coverage is 43.15789% with 54 lines in your changes missing coverage. Please review.

Project coverage is 47.06%. Comparing base (0ad84f3) to head (deb9f3d).

Files with missing lines Patch % Lines
flytekit/core/data_persistence.py 29.16% 50 Missing and 1 partial ⚠️
flytekit/core/obstore_filesystem.py 86.95% 3 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           master    #3033       +/-   ##
===========================================
- Coverage   82.79%   47.06%   -35.73%     
===========================================
  Files           3      202      +199     
  Lines         186    21277    +21091     
  Branches        0     2740     +2740     
===========================================
+ Hits          154    10015     +9861     
- Misses         32    10773    +10741     
- Partials        0      489      +489     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@flyte-bot
Copy link
Contributor

flyte-bot commented Jan 5, 2025

Code Review Agent Run #e101cd

Actionable Suggestions - 3
  • flytekit/core/obstore_filesystem.py - 1
    • Consider using DEFAULT_BLOCK_SIZE constant instead · Line 21-21
  • flytekit/core/data_persistence.py - 2
Review Details
  • Files reviewed - 3 · Commit Range: 353f000..deb9f3d
    • flytekit/core/data_persistence.py
    • flytekit/core/obstore_filesystem.py
    • tests/flytekit/unit/core/test_flyte_directory.py
  • Files skipped - 0
  • Tools
    • Whispers (Secret Scanner) - ✔︎ Successful
    • Detect-secrets (Secret Scanner) - ✔︎ Successful
    • MyPy (Static Code Analysis) - ✔︎ Successful
    • Astral Ruff (Static Code Analysis) - ✔︎ Successful

AI Code Review powered by Bito Logo

connect_timeout = 5
retries = 5
read_timeout = 15
default_block_size = 5 * 2**20
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using DEFAULT_BLOCK_SIZE constant instead

Consider using the DEFAULT_BLOCK_SIZE constant defined on line 9 instead of duplicating the value 5 * 2**20 in ObstoreS3FileSystem. This would improve maintainability and reduce the risk of inconsistencies.

Code suggestion
Check the AI-generated fix before applying
Suggested change
default_block_size = 5 * 2**20
default_block_size = DEFAULT_BLOCK_SIZE

Code Review Run #e101cd


Is this a valid issue, or was it incorrectly flagged by the Agent?

  • it was incorrectly flagged

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can remove this and use the one defined at line 9, right

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, sorry I forgot to use the one in line 9, just fixed it to default_block_size = DEFAULT_BLOCK_SIZE

},
)

kwargs["retries"] = s3_cfg.retries
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider validating S3 retries value

Consider validating the retries value before assigning it to kwargs. A negative or extremely large value could cause issues.

Code suggestion
Check the AI-generated fix before applying
Suggested change
kwargs["retries"] = s3_cfg.retries
if s3_cfg.retries is not None and 0 <= s3_cfg.retries <= 10:
kwargs["retries"] = s3_cfg.retries

Code Review Run #e101cd


Is this a valid issue, or was it incorrectly flagged by the Agent?

  • it was incorrectly flagged

Comment on lines +121 to +123
support_types = ["s3", "gs", "abfs"]
protocol = get_protocol(path)
if protocol not in support_types:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider moving support types to constant

Consider moving the support_types list to a module-level constant since it represents static configuration data. This would improve maintainability and reusability.

Code suggestion
Check the AI-generated fix before applying
 @@ -1,1 +1,3 @@
 +SUPPORTED_PROTOCOLS = ["s3", "gs", "abfs"]
 +
  def split_path(path: str) -> Tuple[str, str]:
 -    support_types = ["s3", "gs", "abfs"]
 -    protocol = get_protocol(path)
 -    if protocol not in support_types:
 +    protocol = get_protocol(path)
 +    if protocol not in SUPPORTED_PROTOCOLS:

Code Review Run #e101cd


Is this a valid issue, or was it incorrectly flagged by the Agent?

  • it was incorrectly flagged

@flyteorg flyteorg deleted a comment from flyte-bot Jan 6, 2025
@@ -46,47 +48,128 @@

# Refer to https://github.com/fsspec/s3fs/blob/50bafe4d8766c3b2a4e1fc09669cf02fb2d71454/s3fs/core.py#L198
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's update this link if we're going to change the args.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! I updated the link in the new commit

store_kwargs["endpoint_url"] = s3_cfg.endpoint
# kwargs["client_kwargs"] = {"endpoint_url": s3_cfg.endpoint}

store = S3Store.from_env(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we cache these setup args functions? i think each call to S3Store is creating a new client underneath the hood in the object store library. let's add lru_cache to this call? @pingsutw

assert if the specific function is called with provided parameters

Signed-off-by: machichima <[email protected]>
@flyte-bot
Copy link
Contributor

flyte-bot commented Jan 8, 2025

Code Review Agent Run #ffca15

Actionable Suggestions - 6
  • tests/flytekit/unit/core/test_data.py - 3
    • Consider fixing assertion tuple comparison syntax · Line 69-71
    • Consider reordering mock setup and assertions · Line 248-255
    • Consider meaningful value for empty parameter · Line 377-378
  • flytekit/core/data_persistence.py - 3
    • Consider impact of changing auth constant · Line 53-53
    • Consider consistent type for anonymous flag · Line 71-72
    • Consider using boolean instead of string · Line 159-159
Additional Suggestions - 4
  • tests/flytekit/unit/core/test_data.py - 4
    • Consider single line set definition · Line 484-487
    • Consider single line function signature · Line 566-568
    • Consider single line FileAccessProvider initialization · Line 543-545
    • Consider single line initialization for readability · Line 593-595
Review Details
  • Files reviewed - 4 · Commit Range: deb9f3d..a1c99ec
    • flytekit/core/data_persistence.py
    • flytekit/core/obstore_filesystem.py
    • tests/flytekit/unit/core/test_data.py
    • tests/flytekit/unit/core/test_data_persistence.py
  • Files skipped - 0
  • Tools
    • Whispers (Secret Scanner) - ✔︎ Successful
    • Detect-secrets (Secret Scanner) - ✔︎ Successful
    • MyPy (Static Code Analysis) - ✔︎ Successful
    • Astral Ruff (Static Code Analysis) - ✔︎ Successful

AI Code Review powered by Bito Logo

Comment on lines +69 to +71
assert (
"file:///abc/happy/"
), "s3://my-s3-bucket/bucket1/" == local_raw_fp.recursive_paths(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider fixing assertion tuple comparison syntax

The assertion syntax appears incorrect. The tuple construction and comparison operator placement seems to be malformed. Consider restructuring the assertion to properly compare the tuple values. A similar issue was also found in tests/flytekit/unit/core/test_data.py (line 69-71).

Code suggestion
Check the AI-generated fix before applying
Suggested change
assert (
"file:///abc/happy/"
), "s3://my-s3-bucket/bucket1/" == local_raw_fp.recursive_paths(
assert ("file:///abc/happy/", "s3://my-s3-bucket/bucket1/") == \
local_raw_fp.recursive_paths(
"file:///abc/happy/", "s3://my-s3-bucket/bucket1/")

Code Review Run #ffca15


Is this a valid issue, or was it incorrectly flagged by the Agent?

  • it was incorrectly flagged

Comment on lines 248 to 255
mock_from_env.return_value = mock.Mock()
mock_from_env.assert_called_with(
"",
config={
"aws_allow_http": "true", # Allow HTTP connections
"aws_virtual_hosted_style_request": "false", # Use path-style addressing
},
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider reordering mock setup and assertions

Consider moving the mock assertions before the s3_setup_args call since the mock setup should ideally be done before exercising the code under test.

Code suggestion
Check the AI-generated fix before applying
 @@ -242,14 +242,14 @@
  def test_s3_setup_args_env_empty(mock_from_env, mock_os, mock_get_config_file):
      mock_get_config_file.return_value = None
      mock_os.get.return_value = None
 +    mock_from_env.return_value = mock.Mock()
      s3c = S3Config.auto()
      kwargs = s3_setup_args(s3c)
 -
 -    mock_from_env.return_value = mock.Mock()
      mock_from_env.assert_called_with(
          "",
          config={
              "aws_allow_http": "true",  # Allow HTTP connections
              "aws_virtual_hosted_style_request": "false",  # Use path-style addressing
          },
      )

Code Review Run #ffca15


Is this a valid issue, or was it incorrectly flagged by the Agent?

  • it was incorrectly flagged

Comment on lines +377 to +378
mock_from_env.assert_called_with(
"",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider meaningful value for empty parameter

Consider providing a meaningful value for the empty string parameter in mock_from_env.assert_called_with(). An empty string for what appears to be a path/endpoint parameter may not properly test the intended behavior.

Code suggestion
Check the AI-generated fix before applying
Suggested change
mock_from_env.assert_called_with(
"",
mock_from_env.assert_called_with(
"s3://test-bucket",

Code Review Run #ffca15


Is this a valid issue, or was it incorrectly flagged by the Agent?

  • it was incorrectly flagged

_ANON = "anon"
_FSSPEC_S3_KEY_ID = "access_key_id"
_FSSPEC_S3_SECRET = "secret_access_key"
_ANON = "skip_signature"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider impact of changing auth constant

Consider if changing _ANON constant from "anon" to "skip_signature" might affect existing code that relies on this value. This appears to be a breaking change in the S3 authentication configuration.

Code suggestion
Check the AI-generated fix before applying
Suggested change
_ANON = "skip_signature"
# TODO: Deprecate "anon" in future versions
_ANON = "anon" # or support both: _ANON = ("anon", "skip_signature")

Code Review Run #ffca15


Is this a valid issue, or was it incorrectly flagged by the Agent?

  • it was incorrectly flagged

Comment on lines 71 to 72
if anonymous:
store_kwargs[_ANON] = "true"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider consistent type for anonymous flag

The _ANON value is being set to 'true' as a string in s3_setup_args() but was previously being set to True boolean. This type inconsistency could cause issues with S3 authentication.

Code suggestion
Check the AI-generated fix before applying
Suggested change
if anonymous:
store_kwargs[_ANON] = "true"
if anonymous:
store_kwargs[_ANON] = True

Code Review Run #ffca15


Is this a valid issue, or was it incorrectly flagged by the Agent?

  • it was incorrectly flagged

kwargs[_ANON] = anonymous
store_kwargs["tenant_id"] = azure_cfg.tenant_id
if anonymous:
kwargs[_ANON] = "true"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using boolean instead of string

Consider using a boolean value directly instead of string 'true' for the _ANON parameter to maintain type consistency. Many systems interpret string 'true' differently than boolean True.

Code suggestion
Check the AI-generated fix before applying
Suggested change
kwargs[_ANON] = "true"
kwargs[_ANON] = True

Code Review Run #ffca15


Is this a valid issue, or was it incorrectly flagged by the Agent?

  • it was incorrectly flagged

@flyte-bot
Copy link
Contributor

flyte-bot commented Jan 10, 2025

Code Review Agent Run #88938f

Actionable Suggestions - 0
Review Details
  • Files reviewed - 1 · Commit Range: a1c99ec..9c7e8db
    • pyproject.toml
  • Files skipped - 0
  • Tools
    • Whispers (Secret Scanner) - ✔︎ Successful
    • Detect-secrets (Secret Scanner) - ✔︎ Successful
    • MyPy (Static Code Analysis) - ✔︎ Successful
    • Astral Ruff (Static Code Analysis) - ✔︎ Successful

AI Code Review powered by Bito Logo

@flyte-bot
Copy link
Contributor

flyte-bot commented Jan 11, 2025

Code Review Agent Run #2ba550

Actionable Suggestions - 4
  • tests/flytekit/unit/core/test_data.py - 1
    • Consider retaining AWS config test parameters · Line 312-312
  • flytekit/core/data_persistence.py - 3
    • Consider consolidating store creation functions · Line 61-85
    • Consider preserving anonymous access functionality · Line 74-76
    • Consider explicit parameters over kwargs unpacking · Line 179-179
Additional Suggestions - 2
  • tests/flytekit/unit/core/test_data.py - 2
    • Consider if storage options test is complete · Line 338-338
    • Consider updating mock patch path consistently · Line 241-241
Review Details
  • Files reviewed - 4 · Commit Range: 9c7e8db..9f5daf0
    • flytekit/core/data_persistence.py
    • flytekit/core/obstore_filesystem.py
    • tests/flytekit/unit/core/test_data.py
    • tests/flytekit/unit/core/test_data_persistence.py
  • Files skipped - 0
  • Tools
    • Whispers (Secret Scanner) - ✔︎ Successful
    • Detect-secrets (Secret Scanner) - ✔︎ Successful
    • MyPy (Static Code Analysis) - ✔︎ Successful
    • Astral Ruff (Static Code Analysis) - ✔︎ Successful

AI Code Review powered by Bito Logo

assert kwargs == {"cache_regions": True}

mock_from_env.return_value = mock.Mock()
mock_from_env.assert_called_with("")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider retaining AWS config test parameters

The mock_from_env assertion appears to be missing configuration parameters that were previously set for AWS HTTP connections and virtual hosted style requests. This could affect test coverage of S3 configuration behavior.

Code suggestion
Check the AI-generated fix before applying
Suggested change
mock_from_env.assert_called_with("")
mock_from_env.assert_called_with(
"",
config={
"aws_allow_http": "true",
"aws_virtual_hosted_style_request": "false",
},
)

Code Review Run #2ba550


Is this a valid issue, or was it incorrectly flagged by the Agent?

  • it was incorrectly flagged

Comment on lines +61 to +85
def s3store_from_env(bucket: str, retries: int, **store_kwargs) -> S3Store:
store = S3Store.from_env(
bucket,
config={
**store_kwargs,
"aws_allow_http": "true", # Allow HTTP connections
"aws_virtual_hosted_style_request": "false", # Use path-style addressing
},
)
return store


@lru_cache
def gcsstore_from_env(bucket: str) -> GCSStore:
store = GCSStore.from_env(bucket)
return store


@lru_cache
def azurestore_from_env(container: str, **store_kwargs) -> AzureStore:
store = AzureStore.from_env(
container,
config=store_kwargs,
)
return store
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider consolidating store creation functions

Consider consolidating the store creation functions into a single factory function to reduce code duplication. The current implementation has similar patterns repeated across s3store_from_env, gcsstore_from_env, and azurestore_from_env.

Code suggestion
Check the AI-generated fix before applying
 -@lru_cache
 -def s3store_from_env(bucket: str, retries: int, **store_kwargs) -> S3Store:
 -    store = S3Store.from_env(
 -        bucket,
 -        config={
 -            **store_kwargs,
 -            "aws_allow_http": "true",
 -            "aws_virtual_hosted_style_request": "false",
 -        },
 -    )
 -    return store
 -
 -@lru_cache
 -def gcsstore_from_env(bucket: str) -> GCSStore:
 -    store = GCSStore.from_env(bucket)
 -    return store
 -
 -@lru_cache
 -def azurestore_from_env(container: str, **store_kwargs) -> AzureStore:
 -    store = AzureStore.from_env(
 -        container,
 -        config=store_kwargs,
 -    )
 -    return store
 +@lru_cache
 +def create_store(store_type: str, container: str, **kwargs) -> Union[S3Store, GCSStore, AzureStore]:
 +    if store_type == "s3":
 +        return S3Store.from_env(container, config={**kwargs, "aws_allow_http": "true", "aws_virtual_hosted_style_request": "false"})
 +    elif store_type == "gcs":
 +        return GCSStore.from_env(container)
 +    elif store_type == "azure":
 +        return AzureStore.from_env(container, config=kwargs)
 +    raise ValueError(f"Unsupported store type: {store_type}")

Code Review Run #2ba550


Is this a valid issue, or was it incorrectly flagged by the Agent?

  • it was incorrectly flagged

Comment on lines +74 to +76
def gcsstore_from_env(bucket: str) -> GCSStore:
store = GCSStore.from_env(bucket)
return store
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider preserving anonymous access functionality

Consider passing the anonymous parameter to gcsstore_from_env() since it was previously used to set the token to _ANON when anonymous=True

Code suggestion
Check the AI-generated fix before applying
Suggested change
def gcsstore_from_env(bucket: str) -> GCSStore:
store = GCSStore.from_env(bucket)
return store
def gcsstore_from_env(bucket: str, anonymous: bool = False) -> GCSStore:
store = GCSStore.from_env(bucket)
if anonymous:
store.token = _ANON
return store

Code Review Run #2ba550


Is this a valid issue, or was it incorrectly flagged by the Agent?

  • it was incorrectly flagged

if anonymous:
kwargs[_ANON] = "true"

store = azurestore_from_env(container, **store_kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider explicit parameters over kwargs unpacking

Consider using direct function call parameters instead of unpacking store_kwargs to improve code readability and maintainability. The function call could be more explicit about what parameters are being passed.

Code suggestion
Check the AI-generated fix before applying
 -    store_kwargs: Dict[str, Any] = {}
 -    if azure_cfg.account_name:
 -        store_kwargs["account_name"] = azure_cfg.account_name
 -    if azure_cfg.account_key:
 -        store_kwargs["account_key"] = azure_cfg.account_key
 -    if azure_cfg.client_id:
 -        store_kwargs["client_id"] = azure_cfg.client_id
 -    if azure_cfg.client_secret:
 -        store_kwargs["client_secret"] = azure_cfg.client_secret
 -    if azure_cfg.tenant_id:
 -        store_kwargs["tenant_id"] = azure_cfg.tenant_id
 -    store = azurestore_from_env(container, **store_kwargs)
 +    store = azurestore_from_env(
 +        container,
 +        account_name=azure_cfg.account_name,
 +        account_key=azure_cfg.account_key,
 +        client_id=azure_cfg.client_id,
 +        client_secret=azure_cfg.client_secret,
 +        tenant_id=azure_cfg.tenant_id
 +    )

Code Review Run #2ba550


Is this a valid issue, or was it incorrectly flagged by the Agent?

  • it was incorrectly flagged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In progress
Development

Successfully merging this pull request may close these issues.

4 participants