-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preload data method added #2
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please have a look at my suggestions and questions.
xcube_zenodo/_utils.py
Outdated
ext = data_id.split(".")[-1] | ||
format_id = MAP_FILE_EXTENSION_FORMAT.get(ext.lower()) | ||
return format_id | ||
def estimate_file_format(data_id: str) -> Union[str, None]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since estimate
sounds a bit like guessing, a better name could be extract
, identify
or something along those lines,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed it to identify
xcube_zenodo/preload.py
Outdated
self.data_id = data_id | ||
self.status = "Not started" | ||
self.progress = 0.0 | ||
self.message = "Preloading not started jet." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you meant yet
here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for reading carefully! :)
LOG = logging.getLogger(__name__) | ||
|
||
|
||
class Event: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A better name could be Task
instead. Could you explain why you chose Event
and maybe then we could decide if it makes sense to rename it.
if list_data_ids_mod: | ||
LOG.info( | ||
f"{data_id} is already pre-loaded. The datasets can be " | ||
f"opened with the following data IDs: " | ||
f"\n{'\n'.join(str(item) for item in list_data_ids_mod)}" | ||
) | ||
elif self.cache_store.has_data(data_id): | ||
LOG.info(f"{data_id} is already pre-loaded.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the difference between the if
and elif
condition here? They both tell the user that the data_id is already pre-loaded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depends on the compressed file. If only one dataset is in the compressed file, or all datasets are merged (this is a preload parameter), then the data_id stays. If multiple files are in the compressed file which cannot be merged, then the data_id is extended by the file name. Therefore, two cases in the if
and elif
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the first case, I want to tell the user the extended data IDs.
self._download_data(*data_ids) | ||
self._decompress_data(*data_ids) | ||
self._prepare_data(*data_ids, **preload_params) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QQ: We only have 3 threads in total, one for each of these tasks right? So each thread will do these tasks for all the data_ids. If that is the case, I do not see how these threads are useful here since they download the data one after the other, decompress after the download for each data_id one after other and so on. If you want to use concurrency to send mutliple download requests, the approach needs to be updated that each data_id has its own thread and so is not dependent on the other data_id status.
If this is not the case, I would be happy to discuss how this works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did it in three thread, to not open up to many threads. What happens when the user preloads 100 data_id?
But I agree. I think it would be better to give each data ID a thread, and maybe limit it to max 10 threads or so .
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2 +/- ##
============================================
- Coverage 100.00% 48.45% -51.55%
============================================
Files 6 7 +1
Lines 118 355 +237
============================================
+ Hits 118 172 +54
- Misses 0 183 +183 ☔ View full report in Codecov by Sentry. |
examples/zenodo_data_store.ipynb
Outdated
@@ -116,7 +116,7 @@ | |||
], | |||
"source": [ | |||
"%%time\n", | |||
"access_token = \"ZsZVfyPCmLYRZQtfSYWruNwXYBykonv0pXZYnrQYNNL0gGMJipYsx0CYvOSB\"\n", | |||
"access_token = \"fill in you Zenodo access token here\"\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"access_token = \"fill in you Zenodo access token here\"\n", | |
"access_token = \"fill in your Zenodo access token here\"\n", |
pyproject.toml
Outdated
@@ -18,7 +18,10 @@ readme = {file = "README.md", content-type = "text/markdown"} | |||
license = {text = "MIT"} | |||
requires-python = ">=3.10" | |||
dependencies = [ | |||
"IPython", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be an optional dependency. Check dynamically for the existence of IPython
at runtime.
environment.yml
Outdated
@@ -4,7 +4,10 @@ channels: | |||
dependencies: | |||
# Required | |||
- python>=3.10 | |||
- IPython |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note, don't include this in the conda recipe. It should be an optional dependency.
xcube_zenodo/_utils.py
Outdated
from xcube.core.store import MULTI_LEVEL_DATASET_TYPE | ||
from xcube.core.store import DataTypeLike | ||
|
||
from typing import Any, Container, Union |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we are already at Python 3.10, don't use Union
, use |
op instead.
xcube_zenodo/_utils.py
Outdated
ext = data_id.split(".")[-1] | ||
format_id = MAP_FILE_EXTENSION_FORMAT.get(ext.lower()) | ||
return format_id | ||
def identify_file_format(data_id: str) -> Union[str, None]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def identify_file_format(data_id: str) -> Union[str, None]: | |
def identify_file_format(data_id: str) -> str | None: |
Or use Optional[str]
.
xcube_zenodo/store.py
Outdated
|
||
|
||
_LOG = logging.getLogger("xcube") | ||
LOG = logging.getLogger(__name__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will evaluate to logging.getLogger("store")
. To unspecific, use logger name "xcube-zenodo" instead.
xcube_zenodo/store.py
Outdated
|
||
|
||
class ZenodoDataStore(DataStore): | ||
"""Implementation of the Zenodo data store defined in the ``xcube_zenodo`` | ||
plugin.""" | ||
|
||
def __init__(self, access_token: str): | ||
def __init__(self, access_token: str, preload_cache_folder: str = None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add parameters that configure the case data store to be used: cache_store_id: str
, cache_store_params: dict[str, Any]
.
self._is_cancelled = False | ||
self._is_closed = False | ||
self._cache_store = cache_store | ||
self._cache_fs: fsspec.AbstractFileSystem = cache_store.fs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@forman can give me feedback on this. I removed os
completely, but I readout root
and fs
from the cache datastore, which I doubt is correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, that's fine, root
and fs
belong to the FsDataStore
's API.
So far, only an in-built monitoring table is shown, when running
preload_data
, which is updated, as soon as status, progress or message updates. See notebookexamples/zenodo_data_store_preload.ipynb
.Tests for the preload still needs to be added.