Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instantiation after pulling remote dataset is failing #566

Closed
ilongin opened this issue Nov 6, 2024 · 0 comments · Fixed by #573
Closed

Instantiation after pulling remote dataset is failing #566

ilongin opened this issue Nov 6, 2024 · 0 comments · Fixed by #573
Assignees
Labels
bug Something isn't working priority-p1

Comments

@ilongin
Copy link
Contributor

ilongin commented Nov 6, 2024

Description

This is follow up issue of #539 (see #560 (comment))

The problem is that when we try to instantiate remote dataset after pulling it, we get some errors related to missing listing datasets found in sources of remote dataset we just pulled.
We need to adjust the logic of cp method (used for instantiating datasets) to avoid having those listing dataset present in DB.

Error example:

Version not specified, pulling the latest one (v1)
Saving dataset ds://02jpg_files@v1 locally: 100%|█████████████████████████████████████████████████████████████████████| 5.00/5.00 [00:00<00:00, 10.2 rows/s]
Dataset ds://02jpg_files@v1 saved locally
_request non-retriable exception: Anonymous caller does not have storage.buckets.get access to the Google Cloud Storage bucket. Permission 'storage.buckets.get' denied on resource (or it may not exist)., 401
Traceback (most recent call last):
  File ".../datachain/.venv/lib/python3.12/site-packages/gcsfs/retry.py", line 126, in retry_request
    return await func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../datachain/.venv/lib/python3.12/site-packages/gcsfs/core.py", line 440, in _request
    validate_response(status, contents, path, args)
  File ".../datachain/.venv/lib/python3.12/site-packages/gcsfs/retry.py", line 113, in validate_response
    raise HttpError(error)
gcsfs.retry.HttpError: Anonymous caller does not have storage.buckets.get access to the Google Cloud Storage bucket. Permission 'storage.buckets.get' denied on resource (or it may not exist)., 401
Error: Dataset lst__gs://datachain-demo/ not found.
Traceback (most recent call last):
  File ".../datachain/src/datachain/cli.py", line 1016, in main
    catalog.pull_dataset(
  File ".../datachain/src/datachain/catalog/catalog.py", line 1454, in pull_dataset
    _instantiate_dataset()
  File ".../datachain/src/datachain/catalog/catalog.py", line 1324, in _instantiate_dataset
    self.cp(
  File ".../datachain/src/datachain/catalog/catalog.py", line 1563, in cp
    node_groups = self.enlist_sources_grouped(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File .../iterative/datachain/src/datachain/catalog/catalog.py", line 703, in enlist_sources_grouped
    listing = Listing(st, client, self.get_dataset(dataset_name))
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../datachain/src/datachain/catalog/catalog.py", line 1090, in get_dataset
    return self.metastore.get_dataset(name)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../src/datachain/data_storage/metastore.py", line 704, in get_dataset
    raise DatasetNotFoundError(f"Dataset {name} not found.")
datachain.error.DatasetNotFoundError: Dataset lst__gs://datachain-demo/ not found.
Telemetry is disabled by environment variable.

To reproduce:

from datachain import DataChain, C

ds = DataChain.from_storage("gs://datachain-demo")
ds1 = ds.filter(C('file.path').glob('*.jpg')).save("jpg_files")
ds2 = ds.filter(C('file.path').glob('*.png')).save("png_files")
ds4 = ds1.union(ds2)
ds5 = ds4.save("jpg_png_files")

dsf1 = ds.filter(C("file.path").glob("*02.jpg")).limit(5).save("02jpg_files")

ds6 = ds5.union(dsf1)
ds6.save("all_files")

Version Info

0.3.11.dev99+g0eabe20
Python 3.12.4

@ilongin ilongin added bug Something isn't working priority-p1 labels Nov 6, 2024
@ilongin ilongin self-assigned this Nov 6, 2024
@ilongin ilongin linked a pull request Nov 7, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working priority-p1
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant