Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preload data method added #2

Merged
merged 4 commits into from
Dec 16, 2024
Merged

Conversation

konstntokas
Copy link
Collaborator

@konstntokas konstntokas commented Dec 10, 2024

So far, only an in-built monitoring table is shown, when running preload_data, which is updated, as soon as status, progress or message updates. See notebook examples/zenodo_data_store_preload.ipynb.

Tests for the preload still needs to be added.

Copy link

@b-yogesh b-yogesh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please have a look at my suggestions and questions.

ext = data_id.split(".")[-1]
format_id = MAP_FILE_EXTENSION_FORMAT.get(ext.lower())
return format_id
def estimate_file_format(data_id: str) -> Union[str, None]:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since estimate sounds a bit like guessing, a better name could be extract, identify or something along those lines,

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed it to identify

self.data_id = data_id
self.status = "Not started"
self.progress = 0.0
self.message = "Preloading not started jet."

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you meant yet here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reading carefully! :)

LOG = logging.getLogger(__name__)


class Event:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A better name could be Task instead. Could you explain why you chose Event and maybe then we could decide if it makes sense to rename it.

Comment on lines +193 to +200
if list_data_ids_mod:
LOG.info(
f"{data_id} is already pre-loaded. The datasets can be "
f"opened with the following data IDs: "
f"\n{'\n'.join(str(item) for item in list_data_ids_mod)}"
)
elif self.cache_store.has_data(data_id):
LOG.info(f"{data_id} is already pre-loaded.")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference between the if and elif condition here? They both tell the user that the data_id is already pre-loaded.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends on the compressed file. If only one dataset is in the compressed file, or all datasets are merged (this is a preload parameter), then the data_id stays. If multiple files are in the compressed file which cannot be merged, then the data_id is extended by the file name. Therefore, two cases in the if and elif.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the first case, I want to tell the user the extended data IDs.

Comment on lines +156 to +158
self._download_data(*data_ids)
self._decompress_data(*data_ids)
self._prepare_data(*data_ids, **preload_params)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QQ: We only have 3 threads in total, one for each of these tasks right? So each thread will do these tasks for all the data_ids. If that is the case, I do not see how these threads are useful here since they download the data one after the other, decompress after the download for each data_id one after other and so on. If you want to use concurrency to send mutliple download requests, the approach needs to be updated that each data_id has its own thread and so is not dependent on the other data_id status.
If this is not the case, I would be happy to discuss how this works.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did it in three thread, to not open up to many threads. What happens when the user preloads 100 data_id?

But I agree. I think it would be better to give each data ID a thread, and maybe limit it to max 10 threads or so .

@forman forman self-requested a review December 11, 2024 10:36
Copy link

codecov bot commented Dec 11, 2024

Codecov Report

Attention: Patch coverage is 29.61538% with 183 lines in your changes missing coverage. Please review.

Project coverage is 48.45%. Comparing base (021eda3) to head (a405658).
Report is 5 commits behind head on main.

Files with missing lines Patch % Lines
xcube_zenodo/preload.py 18.51% 154 Missing ⚠️
xcube_zenodo/store.py 50.00% 27 Missing ⚠️
xcube_zenodo/_utils.py 84.61% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##              main       #2       +/-   ##
============================================
- Coverage   100.00%   48.45%   -51.55%     
============================================
  Files            6        7        +1     
  Lines          118      355      +237     
============================================
+ Hits           118      172       +54     
- Misses           0      183      +183     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -116,7 +116,7 @@
],
"source": [
"%%time\n",
"access_token = \"ZsZVfyPCmLYRZQtfSYWruNwXYBykonv0pXZYnrQYNNL0gGMJipYsx0CYvOSB\"\n",
"access_token = \"fill in you Zenodo access token here\"\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"access_token = \"fill in you Zenodo access token here\"\n",
"access_token = \"fill in your Zenodo access token here\"\n",

pyproject.toml Outdated
@@ -18,7 +18,10 @@ readme = {file = "README.md", content-type = "text/markdown"}
license = {text = "MIT"}
requires-python = ">=3.10"
dependencies = [
"IPython",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be an optional dependency. Check dynamically for the existence of IPython at runtime.

environment.yml Outdated
@@ -4,7 +4,10 @@ channels:
dependencies:
# Required
- python>=3.10
- IPython
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note, don't include this in the conda recipe. It should be an optional dependency.

from xcube.core.store import MULTI_LEVEL_DATASET_TYPE
from xcube.core.store import DataTypeLike

from typing import Any, Container, Union
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we are already at Python 3.10, don't use Union, use | op instead.

ext = data_id.split(".")[-1]
format_id = MAP_FILE_EXTENSION_FORMAT.get(ext.lower())
return format_id
def identify_file_format(data_id: str) -> Union[str, None]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def identify_file_format(data_id: str) -> Union[str, None]:
def identify_file_format(data_id: str) -> str | None:

Or use Optional[str].



_LOG = logging.getLogger("xcube")
LOG = logging.getLogger(__name__)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will evaluate to logging.getLogger("store"). To unspecific, use logger name "xcube-zenodo" instead.



class ZenodoDataStore(DataStore):
"""Implementation of the Zenodo data store defined in the ``xcube_zenodo``
plugin."""

def __init__(self, access_token: str):
def __init__(self, access_token: str, preload_cache_folder: str = None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add parameters that configure the case data store to be used: cache_store_id: str, cache_store_params: dict[str, Any].

self._is_cancelled = False
self._is_closed = False
self._cache_store = cache_store
self._cache_fs: fsspec.AbstractFileSystem = cache_store.fs
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@forman can give me feedback on this. I removed os completely, but I readout root and fs from the cache datastore, which I doubt is correct.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that's fine, root and fs belong to the FsDataStore's API.

@konstntokas konstntokas merged commit aa6ef13 into main Dec 16, 2024
1 of 3 checks passed
@konstntokas konstntokas deleted the konstntokas-xxx-add_preload_for_zip branch December 19, 2024 12:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants