Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

an option to download only added files for a given dataset version #1100

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

kirillfish
Copy link

So far, when calling get_local_copy method on a clearml.Dataset object, you would download all the files from the dataset AND all its parents recursively, or create soft links for all files that have been downloaded previously. But there was no way to get only files added to this particular version of the dataset, ignoring all the parents. This little PR implements this exact feature.

Testing Instructions

  1. Register a parent dataset and add some files
  2. Register a child dataset, inherited from the first one, and add some more files
  3. Use only_added argument (False by default):
from clearml import Dataset
dataset = Dataset.get(dataset_name='child')
data_base_dir = dataset.get_local_copy(only_added=True)

data_base_dir will contain only files returned by list_added_files

Other Information

Potential issue:

Soft links are still being created for files in the diff which have already been downloaded - this is ok

But it you first call child.get_local_copy(only_added=True) and then once again child.get_local_copy(), it will not create soft links for existing files and download the diff once again -- probably not ok... The same applies to "grandchildren" datasets. Still figuring out why. On the other hand, this could be ok if we assume only_added=True flag is supposed to be used only for debug purposes or to quickly inspect datasets.

@eugen-ajechiloae-clearml
Copy link
Collaborator

I believe that this will also download the modified files (which is good), but maybe the name only_added is not appropriate. How about ignore_parent_datasets?

@kirillfish kirillfish marked this pull request as ready for review August 28, 2023 16:14
@kirillfish
Copy link
Author

@eugen-ajechiloae-clearml done, I renamed it

unified_list = set()
for ds_id in datasets:
dataset = self.get(dataset_id=ds_id)
unified_list |= set(dataset._dataset_file_entries.values())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kirillfish This line seems to break:

    unified_list |= set(dataset._dataset_file_entries.values())
TypeError: unhashable type: 'FileEntry'

Did you mean to construct the set based on the ids? Like:

unified_list |= {entry.id for entry in dataset._dataset_file_entries.values()}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kirillfish btw, here's a simple test snippet (you'll need to adapt it a little) to check that it works properly.

from pathlib import Path
import os

from clearml import Dataset, StorageManager



def main():
    manager = StorageManager()

    print("STEP1 : Downloading mnist dataset")
    mnist_dataset = Path(manager.get_local_copy(
        remote_url="https://allegro-datasets.s3.amazonaws.com/datasets/MNIST.zip", name="MNIST"))
    mnist_dataset_train = mnist_dataset / "TRAIN"
    mnist_dataset_test = mnist_dataset / "TEST"

    print("STEP2 : Creating the training dataset")
    mnist_dataset = Dataset.create(
        dataset_project="dataset_examples", dataset_name="MNIST Training Dataset")
    mnist_dataset.add_files(path=mnist_dataset_train, dataset_path="TRAIN")
    mnist_dataset.upload()
    mnist_dataset.finalize()

    print("STEP3 : Create a child dataset of mnist dataset using TEST Dataset")
    child_dataset = Dataset.create(
        dataset_project="dataset_examples", dataset_name="MNIST Complete Dataset", parent_datasets=[mnist_dataset.id])
    child_dataset.add_files(path=mnist_dataset_test, dataset_path="TEST")
    child_dataset.upload()
    child_dataset.finalize()

    print("We are done, have a great day :)")
    
main()
dataset_path = Dataset.get("<replace with child dataset id, i.e. MNIST Complete Dataset ID>").get_local_copy()
print(Path(dataset_path).glob("*"))

os.rmtree(dataset_path)

dataset_path = Dataset.get("<replace with child dataset id, i.e. MNIST Complete Dataset ID>").get_local_copy(ignore_parent_datasets=True)
print(Path(dataset_path).glob("*"))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kirillfish any news on this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @kirillfish , any update?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants