Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

api.get_url: use index storage for getting remote URL #9676

Merged
merged 4 commits into from
Jul 4, 2023

Conversation

pmrowla
Copy link
Contributor

@pmrowla pmrowla commented Jun 30, 2023

Thank you for the contribution - we'll try to review it as soon as possible. πŸ™

Fixes #9176

Previously, dvc.api.get_url just tried to get the default or specified remote and then manually generated a URL based on hard-coded 'md5' dict lookup. As a result, it did not support per-output remote: settings and did not support cases where a legacy 2.x output would return md5-dos2unix info instead of md5.

  • DVCFileSystem.info() now returns a properly generated dvc_info['remote_url'] field for DVC outs that gets populated using the actual remote storage for the given data index entry (meaning per-output remote: is used when set, and default or --remote flag otherwise)
  • api.get_url and dvc get --show-url now use the remote_url field from dvcfs.info()

cli:

$ dvc get --show-url . test.txt                      ⏎
s3://peter-test-unversioned-ap/legacy-test/files/md5/a2/ead3516dd1be4a3c7f45716c0a0eb7
$ dvc get --show-url . test-legacy.txt
s3://peter-test-unversioned-ap/legacy-test/26/8a5059001855fef30b4f95f82044ed

api:

>>> from dvc.api import DVCFileSystem, get_url
>>> get_url("test-legacy.txt", repo=".")
's3://peter-test-unversioned-ap/legacy-test/26/8a5059001855fef30b4f95f82044ed'
>>> fs = DVCFileSystem(".")
>>> fs.info("test-legacy.txt")["dvc_info"]["remote_url"]
's3://peter-test-unversioned-ap/legacy-test/26/8a5059001855fef30b4f95f82044ed'

@pmrowla pmrowla added A: api Related to the dvc.api A: get Related to dvc get labels Jun 30, 2023
@pmrowla pmrowla self-assigned this Jun 30, 2023
@pmrowla pmrowla requested a review from a team June 30, 2023 05:26
@codecov
Copy link

codecov bot commented Jun 30, 2023

Codecov Report

Patch coverage: 87.87% and project coverage change: -0.02 ⚠️

Comparison is base (8661abf) 90.55% compared to head (edc303b) 90.53%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #9676      +/-   ##
==========================================
- Coverage   90.55%   90.53%   -0.02%     
==========================================
  Files         480      480              
  Lines       36399    36422      +23     
  Branches     5230     5235       +5     
==========================================
+ Hits        32960    32976      +16     
- Misses       2850     2855       +5     
- Partials      589      591       +2     
Impacted Files Coverage Ξ”
dvc/fs/dvc.py 95.18% <81.81%> (-0.59%) ⬇️
dvc/api/data.py 83.33% <86.66%> (-1.39%) ⬇️
dvc/repo/__init__.py 95.02% <100.00%> (+0.05%) ⬆️
dvc/repo/open_repo.py 82.50% <100.00%> (+0.11%) ⬆️
tests/func/api/test_data.py 100.00% <100.00%> (ΓΈ)

... and 1 file with indirect coverage changes

β˜” View full report in Codecov by Sentry.
πŸ“’ Do you have feedback about the report comment? Let us know in this issue.

@efiop
Copy link
Contributor

efiop commented Jun 30, 2023

Should it just use repo.index.data directly instead? Feels like this is a bit out of scope for dvcfs

@pmrowla
Copy link
Contributor Author

pmrowla commented Jun 30, 2023

Should it just use repo.index.data directly instead? Feels like this is a bit out of scope for dvcfs

I'd be fine with making this change.

I went with dvcfs initially because the dvc_info field is already somewhat internal-only (we return things like raw DataIndexEntry()'s in it) but we may want to revisit whether or not that belongs in dvcfs as well

@efiop
Copy link
Contributor

efiop commented Jun 30, 2023

@pmrowla Yeah, those are there just historically since we didn't have index before and dvcfs was the most convenient way to access those. We've removed some over the past months (e.g. isdvc isout from datafs, but not dvcfs yet). I think this PR is alright, but maybe take a look at doing it with index, it might even be easier and cleaner.

@pmrowla pmrowla changed the title dvcfs: return remote_url in dvc_info api.get_url: use index storage for getting remote URL Jul 3, 2023
Comment on lines +57 to +58
if not fs_info and "md5-dos2unix" in dvc_info:
ret["md5-dos2unix"] = dvc_info["md5-dos2unix"]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unrelated to the get_url after the index change but legacy hashes still need to be handled in dvcfs

if repo is None:
repo = self._make_repo(url=url, rev=rev, subrepos=subrepos, **repo_kwargs)
assert repo is not None
# pylint: disable=protected-access
repo_factory = repo._fs_conf["repo_factory"]
self._repo_stack.enter_context(repo)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any time that dvcfs constructs a new repo (or subrepo) instance we need to ensure that it gets closed cleanly, otherwise the index context for that subrepo could end up being opened/locked and never closed/unlocked

@@ -67,6 +67,7 @@ def open_repo(url, *args, **kwargs):
url = os.getcwd()

if os.path.exists(url):
url = os.path.abspath(url)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was previously broken in the case where you passed a relative local path like repo='.' into dvc.api calls.

@efiop efiop merged commit eb1ed40 into iterative:main Jul 4, 2023
20 checks passed
@pmrowla pmrowla deleted the dvcfs-info-url branch July 4, 2023 09:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: api Related to the dvc.api A: get Related to dvc get
Projects
None yet
Development

Successfully merging this pull request may close these issues.

dvc get --show-url fails with missing md5 key.
2 participants