Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#1140 - List Files #1626

Draft
wants to merge 10 commits into
base: dev
Choose a base branch
from
Draft

#1140 - List Files #1626

wants to merge 10 commits into from

Conversation

devclinton
Copy link
Member

@devclinton devclinton commented Sep 7, 2021

#1140
#1625

TODO add pytest-lazxfixture

TODO add ticket for support PathLink on download to path

TODO Add support for exclude filters to filter assets/files

TODO add ticket to add to local platform. Can be tackled outside of this PR

Depends on https://github.com/InstituteforDiseaseModeling/pyCOMPS/issues/39 for tests

@devclinton
Copy link
Member Author

devclinton commented Sep 7, 2021

@shchen-idmod, @ZDu-IDM this is core functionality to mainly replace the download command. It adds new APIs to the platform to allow fetching files. This should be useful in many other situations as well since we get nice pythonic code to iterate files like this

for asset in experiment.list_assets(filters=["*.txt"])
     asset.download_to_path(output_path)
  
for simulation in experiment.simulations:
    for asset in simulation.list_files(filters=["*.txt"])
       asset.download_to_path(output_path)

and

files = experiment.list_children(filters=[".txt"]
for simulation, files in files.items():
    for file in files:
       file.download_to_path(f"{experiment.id}/{simulation.id}/")

Another thing we could do is have users have alternates to analyzers. For example

# get csvs
files = experiment.list_children(filters=[".csv"]
content = []
for simulation, files in files.items():
    for file in files:
       #create data frames(select data)
       content.append(pd.read_csv(file.download_stream().decode('utf-8')))
# aggregate into one file
df = pd.concat(content)
df.to_csv('all.csv')

       

@devclinton devclinton added the Core label Sep 7, 2021
@shchen-idmod
Copy link
Collaborator

idmtools_platform_comps/tests/test_experiment_operations.py, test_list_assets used to return 5 files for the experiment, but now only return 1 file. is this new?

@devclinton
Copy link
Member Author

idmtools_platform_comps/tests/test_experiment_operations.py, test_list_assets used to return 5 files for the experiment, but now only return 1 file. is this new?

Yes. Before, the assets would be duplicates. For example, an experiment with 5 simulations containing using the asset model1.py would contain 5 copies of model1.py

@devclinton
Copy link
Member Author

devclinton commented Sep 7, 2021

I still need to update tests for this. Also, a note for you sharon. I tried using fixtures here to see how we could leverage these on COMPS specifically to reuse test data more often. This has some advantages of making some tests cleaner when we want to verify on multiple COMPS environments. I am not tied to it for this, but thought it would be worth exploring as I developed this

@shchen-idmod
Copy link
Collaborator

idmtools_platform_comps/tests/test_experiment_operations.py, test_list_assets used to return 5 files for the experiment, but now only return 1 file. is this new?

Yes. Before, the assets would be duplicates. For example, an experiment with 5 simulations containing using the asset model1.py would contain 5 copies of model1.py

before it was not 5 copies of model1.py. they are one model1.py and 4 different config.json from 4 simulations.

@devclinton
Copy link
Member Author

idmtools_platform_comps/tests/test_experiment_operations.py, test_list_assets used to return 5 files for the experiment, but now only return 1 file. is this new?

Yes. Before, the assets would be duplicates. For example, an experiment with 5 simulations containing using the asset model1.py would contain 5 copies of model1.py

before it was not 5 copies of model1.py. they are one model1.py and 4 different config.json from 4 simulations.

Ah yes, those were incorrectly being listed under list_assets since they are not technically assets, but instead input files. They now show under list_files instead.

assert EXPECTED_ASSETS_PER_MODEL1_EXPERIMENT == len(experiment_assets)
assert experiment_assets[0].filename == "model1.py"
assert experiment_assets[0].absolute_path is None
assert experiment_assets[0].checksum == UUID('926ee7aa-4f29-bc94-ff99-8c5f83a6d85a')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got different UUID:
test_file_listings.py:91 (test_list_files_experiment[python_files_and_assets_slurm2])
UUID('8e02e44c-5754-c79c-2b5c-9ffc5278c59c') != UUID('926ee7aa-4f29-bc94-ff99-8c5f83a6d85a')
I am using windows

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With files, there is no true UUID. Uuid is a checksum of the object

@shchen-idmod
Copy link
Collaborator

idmtools_platform_comps/tests/test_experiment_operations.py, test_list_assets used to return 5 files for the experiment, but now only return 1 file. is this new?

Yes. Before, the assets would be duplicates. For example, an experiment with 5 simulations containing using the asset model1.py would contain 5 copies of model1.py

before it was not 5 copies of model1.py. they are one model1.py and 4 different config.json from 4 simulations.

Ah yes, those were incorrectly being listed under list_assets since they are not technically assets, but instead input files. They now show under list_files instead.

So we need to update this testcase

@shchen-idmod
Copy link
Collaborator

examples/cookbook/download_files_from_comps.py download output.zip file now(used to unziped file) which is OK. but I can not unzip output.zip file(always throw some error)

@shchen-idmod
Copy link
Collaborator

idmtools_platform_comps/tests/test_download_workitem.py has 2 tests failed:
test_comps_download()
test_inputs_as_id_files()
we can add extract_after_download=True, to DownloadWorkItem call to fix these tests

@shchen-idmod
Copy link
Collaborator

shchen-idmod commented Sep 8, 2021

simulation_operations.py list_files with filters fail as
linked_files = self.platform._simulations.list_files(sim, filters=['*.json']):

        eid = "e046967b-d110-ec11-92df-f0921c167864"
        e_p: Experiment = Experiment.from_id(eid)
        with self.subTest("test_list_files_simulations"):
            for sim in e_p.simulations:
                linked_files = self.platform._simulations.list_files(sim, filters=['*.json'])

with error:


..\idmtools_platform_comps\comps_operations\simulation_operations.py:653: in list_files
    return apply_file_filters(simulation._comps_metadata['output_files'], filters)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

assets = [<Asset: output\result.json from None>, <Asset: config.json from None>, <Asset: stderr.txt from None>, <Asset: stdout.txt from None>]
filters = ['*.json']

    def apply_file_filters(assets: List['Asset'], filters: TFILE_FILTER_TYPE) -> List['Asset']:
        """
        Apply a list of filter functions to a list of assets.
    
        Args:
            assets: Assets to apply functions to
            filters: List of functions
    
        Returns:
            List of filtered assets
        """
        if filters:
            result = []
            for file in assets:
                for file_filter in filters:
>                   if file_filter(file.short_remote_path()):
E                   TypeError: 'str' object is not callable

..\..\idmtools_core\idmtools\utils\filters\asset_filters.py:151: TypeError

@shchen-idmod
Copy link
Collaborator

experiment_operations.py list_files return wrong files. following 2 cases should return same number of files: but second case return 0 file for all_linked_files = self.platform._experiments.list_files(e_p)

            eid = "e046967b-d110-ec11-92df-f0921c167864"
            e_p: Experiment = Experiment.from_id(eid)       
            with self.subTest("test_list_files_and_filters_experiment"):
                all_linked_files = self.platform._experiments.list_files(e_p, children=True)
                self.assertEqual(16, len(all_linked_files))

                all_linked_files = self.platform._experiments.list_files(e_p)
               self.assertEqual(16, len(all_linked_files))

@devclinton
Copy link
Member Author

experiment_operations.py list_files return wrong files. following 2 cases should return same number of files: but second case return 0 file for all_linked_files = self.platform._experiments.list_files(e_p)

            eid = "e046967b-d110-ec11-92df-f0921c167864"
            e_p: Experiment = Experiment.from_id(eid)       
            with self.subTest("test_list_files_and_filters_experiment"):
                all_linked_files = self.platform._experiments.list_files(e_p, children=True)
                self.assertEqual(16, len(all_linked_files))

                all_linked_files = self.platform._experiments.list_files(e_p)
               self.assertEqual(16, len(all_linked_files))

I see a difference in the two calls. One is called to list files for experiment and the other is listing files for experiment with children. The second(without children) will always return 0 files for COMPs since it does not support files(non shared assets) on experiments.

@ZDu-IDM
Copy link
Collaborator

ZDu-IDM commented Sep 9, 2021

@devclinton
Some comments for your reference.

1. docstr: simulation or experiment?

File: idmtools_core/idmtools/entities/experiment.py

    def __hash__(self):
        """
        Hash of simulation(id).

        Returns:
            Hash of simulation
        """
        return id(self.uid)

simulation --> experiment?

2. docstr: true --> True

File: idmtools_core/idmtools/entities/experiment.py

    def list_assets(self, children: bool = False, platform: 'IPlatform' = None, filters: TFILE_FILTER_TYPE = None, **kwargs) -> List[Asset]:
        """
        List assets(shared files) for the experiment.

        Args:
            children: When set to true, simulation assets will be loaded as well
            platform: Optional platform to load assets list from
            filters: Filters to apply. These should be a function that takes a str and return true or false
            **kwargs:
        Returns:
            List of assets
        """
children: When set to true, simulation assets will be loaded as well

--->

children: When set to True, simulation assets will be loaded as well

3. DownloadCommand is not used in anywhere

Better to have a test for it.

File: idmtools_core/idmtools/utils/download.py

@dataclass
class DownloadCommand:
    """
    DownloadCommand defines an interface to make downloading and replicating remote experiments locally easy.
    """
    #: List of glob patterns. See https://docs.python.org/3.7/library/glob.html for details on the patterns
    file_patterns: List[str] = field(default_factory=list)
    ......

4. Need to call normalize_filters(...)?

File: idmtools_core/idmtools/utils/filters/asset_filters.py

By definition:

TFILE_FILTER_TYPE = Union[str, List[str], List[Callable[[str], bool]], Callable[[str], bool]]

If filter is type of str or List[str], the following line will fail:

file_filter(file.short_remote_path())

in

def apply_file_filters(assets: List['Asset'], filters: TFILE_FILTER_TYPE) -> List['Asset']:
    if filters:
        result = []
        for file in assets:
            for file_filter in filters:
                if file_filter(file.short_remote_path()):
                    result.append(file)
        return result
    else:
        return assets

Q: maybe we need to do the following first?

filters = normalize_filters(filters)

5. asset pass on all filters or just any one?

File: idmtools_core/idmtools/utils/filters/asset_filters.py

def apply_file_filters(assets: List['Asset'], filters: TFILE_FILTER_TYPE) -> List['Asset']:
    if filters:
        result = []
        for file in assets:
            for file_filter in filters:
                if file_filter(file.short_remote_path()):
                    result.append(file)
        return result
    else:
        return assets

Q: Pass all filters or just one?

If we follow 'any' logic, then

            for file_filter in filters:
                if file_filter(file.short_remote_path()):
                    result.append(file)

-->

            for file_filter in filters:
                if file_filter(file.short_remote_path()):
                    result.append(file)
                    break

Note: adding break to avoid duplicated files added to result.

If we follow 'all' logic, then

            for file_filter in filters:
                if file_filter(file.short_remote_path()):
                    result.append(file)

-->

            if all([file_filter(file.short_remote_path()) for file_filter in filters]):
                result.append(file)

6. What about the case: a.relative_path is None?

File: idmtools_platform_comps/idmtools_platform_comps/comps_operations/asset_collection_operations.py

    def to_entity(self, asset_collection: Union[COMPSAssetCollection, SimulationFile, List[SimulationFile], OutputFileMetadata, List[WorkItemFile]], **kwargs) \
            -> AssetCollection:
         
                ......
                if a.relative_path and a.relative_path == ".":
                    a.relative_path = ""

For the case a.relative_path is None, should we do the same?

           a.relative_path = ""

7. Need to update comment

There is a left over comment: # Send the assets for the experiment

File: idmtools_platform_comps/idmtools_platform_comps/comps_operations/experiment_operations.py

        # Send the assets for the experiment
        ac_id = self.send_assets(experiment)
        e.configuration._asset_collection_id = ac_id
        return e
    # Send the assets for the experiment ---> ? (or just remove the comment)

8. Need to update docstr: Returns

File: idmtools_platform_comps/idmtools_platform_comps/comps_operations/experiment_operations.py

    def send_assets(self, experiment: Experiment, **kwargs) -> UUID:
        """
        Send assets related to the experiment.

        Args:
            experiment: Experiment to send assets for

            **kwargs:

        Returns:
            None
        """

Need to update

        Returns:
            None

9. Missing parent in docstr Args

File: idmtools_platform_comps/idmtools_platform_comps/comps_operations/simulation_operations.py

    def get_asset_collection_from_comps_simulation(self, simulation: COMPSSimulation, parent: Experiment) -> Optional[AssetCollection]:
        """
        Get assets for a comps simulation.

        Args:
            simulation: Experiment to get asset collection for.

        Returns:
            AssetCollection if configuration is set and configuration.asset_collection_id is set.
        """

10. get_experiment_link is not called anywhere

Better to have a usage/test case.

File: idmtools_platform_comps/idmtools_platform_comps/comps_platform.py

    def get_experiment_link(self, experiment: Experiment):
        """
        Get a link to an experiment.

        Args:
            experiment: Experiment to generate link for

        Returns:
            Link to experiment
        """
        return f"{self.endpoint}/#explore/Experiments?filters=Id={experiment.uid}"

@shchen-idmod
Copy link
Collaborator

shchen-idmod commented Sep 10, 2021

this line always fail in debug mode or run 2 platforms
https://github.com/shchen-idmod/idmtools/blob/2e5db55b645d28b06f48d7b4c1670e8effd16209/idmtools_platform_comps/tests/test_experiment_operations.py#L37


..\idmtools_core\idmtools\assets\asset.py:388: in download_to_path
    self.__write_download_generator_to_stream(out)
C:\idmtools_filelist_37\lib\site-packages\backoff\_sync.py:94: in retry
    ret = target(*args, **kwargs)
..\idmtools_core\idmtools\assets\asset.py:350: in __write_download_generator_to_stream
    gen = self.download_generator()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <Asset: model1.py from c:\github\idmtools\idmtools_test\idmtools_test\inputs\python\model1.py>

    def download_generator(self) -> Generator[bytearray, None, None]:
        """
        A Download Generator that returns chunks of bytes from the file.
    
        Returns:
            Generator of bytearray
    
        Raises:
            ValueError - When there is not a download generator hook defined
    
        Notes:
            TODO - Add a custom error with doclink.
        """
        if not self.download_generator_hook:
>           raise ValueError("To be able to download, the Asset needs to be fetched from a platform object")
E           ValueError: To be able to download, the Asset needs to be fetched from a platform object

..\idmtools_core\idmtools\assets\asset.py:321: ValueError

@shchen-idmod
Copy link
Collaborator

pycomps 2.5.0 fix suite id issue and it is already in prod artifactory. please also update pycomps version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants